Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-16695: Improve expired poll logging #15909

Merged
merged 8 commits into from May 14, 2024

Conversation

lianetm
Copy link
Contributor

@lianetm lianetm commented May 9, 2024

Improve consumer log for expired poll timer, by showing how much time was the max.poll.interval.ms exceeded. This should be helpful in guiding the user to tune that config on the common case of long-running processing causing the consumer to leave the group. Inspired by other clients that log such information on the same situation.

@lianetm
Copy link
Contributor Author

lianetm commented May 9, 2024

Hey @mjsax , here is the improved logging following your suggestion, helpful indeed I expect. Would you have a chance to take a look? Thanks!

@mjsax mjsax added the consumer label May 9, 2024
Copy link
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! -- Need to wait for Jenkins to pass before merging. LGTM.

Copy link
Contributor

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the idea of this, but wouldn't it be more useful to log the amount of time until the consumer polls next, not until the heartbeat thread polls

@@ -193,11 +193,12 @@ public NetworkClientDelegate.PollResult poll(long currentTimeMs) {
}
pollTimer.update(currentTimeMs);
if (pollTimer.isExpired() && !membershipManager.isLeavingGroup()) {
logger.warn("Consumer poll timeout has expired. This means the time between " +
logger.warn("Consumer poll timeout has expired, exceeded by {} ms. This means the time between " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is what gets logged when the heartbeat thread notices the consumer has failed to poll in time and dropped out of the group -- so the "time exceeded" is just going to be roughly the max poll interval + the heartbeat interval, no?

I do think it's a great idea to log the amount of time by which the max poll interval was exceeded, but imo the more useful information is how long after the max poll interval the consumer took to actually hit poll again, not how long the heartbeat thread took to notice it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I think that the background thread will notice more quickly than you said, but this just means the "time exceeded" is going to be very close to max poll interval. The heartbeat request manager checks to see whether it is time to send a heartbeat more regularly than it actually sends a heartbeat.

Maybe enhancing the logging in HeartbeatRequestManager.resetPollTimer would be a suitable point. This is where the heartbeat request manager will notice that it has already left the group because of delinquent polling, and rejoins when the next poll occurs. @lianetm that's probably workable I think.

Copy link
Contributor Author

@lianetm lianetm May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, good point, it would actually take this a step further, where indeed should be more useful. As @AndrewJSchofield pointed, the HB manager will notice sooner in practice (even sooner than the HB interval), but we do know when the next poll happens, so can definitely get a more accurate exceed time (in-between calls to poll, which translates to poll events handled in this same manager). On it...thanks for the comments!

@lianetm
Copy link
Contributor Author

lianetm commented May 10, 2024

Done, so I simplified what we log when the background thread realizes time's up and leaves the group to rejoin eventually (that's all the relevant info at that point). I then moved the log that details the expired max.poll.interval to the place where we can give a more accurate exceeded time, which is on the next app poll event that the background handles. Also updated the test to make sure it checks not only how the exceed time is calculated, but also where it is calculated. Makes sense? More accurate now indeed, thanks!

Copy link
Contributor

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated PR is definitely better.

Because we have the application thread and the background thread running concurrently, and the application thread waiting in a long poll(Duration) actually polls internally, what we are measuring here is the time since the last of these internal polls, which will be approximately the end of the application's latest call to poll(Duration). I think that's going to be good enough for this purpose, helping the user understand whether they need to increase max.poll.interval.ms.

if (pollTimer.isExpired()) {
logger.debug("Poll timer has been reset after it had expired");
logger.warn("Time between subsequent calls to poll() was longer than the configured" +
"max.poll.interval.ms, exceeded by %s ms. This typically implies that the " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap, my bad, I had found it too so it's fixed in a commit above

@@ -255,11 +257,15 @@ public long maximumTimeToWait(long currentTimeMs) {
* member to {@link MemberState#JOINING}, so that it rejoins the group.
*/
public void resetPollTimer(final long pollMs) {
pollTimer.update(pollMs);
if (pollTimer.isExpired()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have a method added to Timer such as long hasExpiredBy() so the check for expiration and the calculation of by how much is encapsulated in the timer itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, makes total sense, so moved the calculation to the timer, with an isExpiredBy. Small twist to what I understand you were suggesting, I kept the isExpired check, just to avoid having to deal with the logic of deducing if the timer is expired based on the isExpiredBy on the HBManager. Seems better to let the timer know the semantics of when it's considered expired (it does consider >= for instance, so just avoiding to bring those semantics into the HBManager). Makes sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. When I was reviewing the previous iteration, I found myself looking within the Timer at the internal variables and then trying to figure out whether the derivation being performed was valid. Makes sense to do it within the Timer. Perfectly happy with 2 methods like this.

@lianetm
Copy link
Contributor Author

lianetm commented May 10, 2024

Just to clarify what we're getting here, related to @AndrewJSchofield 's very valid point. With this we get the time between internal poll events, which do not translate exactly to calls to consumer.poll depending on the situation. So the log here will be very helpful to tune the config in cases where the delay that led to leaving the group was due to the client app taking too long to process messages after a call to poll for example. It would be less accurate in cases where the delay is due to the fetch not getting messages, since we internally generate more poll events while at it.

@@ -255,11 +257,15 @@ public long maximumTimeToWait(long currentTimeMs) {
* member to {@link MemberState#JOINING}, so that it rejoins the group.
*/
public void resetPollTimer(final long pollMs) {
pollTimer.update(pollMs);
if (pollTimer.isExpired()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense. When I was reviewing the previous iteration, I found myself looking within the Timer at the internal variables and then trying to figure out whether the derivation being performed was valid. Makes sense to do it within the Timer. Perfectly happy with 2 methods like this.

Copy link
Contributor

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit about the logging, but overall this now looks good to me!

@@ -193,11 +193,8 @@ public NetworkClientDelegate.PollResult poll(long currentTimeMs) {
}
pollTimer.update(currentTimeMs);
if (pollTimer.isExpired() && !membershipManager.isLeavingGroup()) {
logger.warn("Consumer poll timeout has expired. This means the time between " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually leave this log untouched? On the one hand I kind of agree with this simplification, and logs are by no means a part of the public contract, but I know for a fact that some people have built observation tools and/or dashboards for things like rebalancing issues by searching for relevant log strings such as this one (I know because I built one myself a long time ago)

I don't feel super strongly about this so I won't push back if you'd prefer to clean it up, but imo it doesn't hurt to leave the log here as well

Also: in some extreme cases, eg an infinite loop in a user's processing logic, the consumer might never return to call poll at all. In less extreme cases, eg some kind of long processing that takes on the order of minutes per record, it might be a very very long time before the consumer gets back to poll and logs the message you added. For the latter case, I think it would be valuable to keep this part about increasing the max.poll.interval or lowering the max.poll.records in the message we log here, when the max poll interval is first missed, so that users know what to do immediately and don't have to wait until they actually get through all 1000 records (or whatever max.poll.records is set to) and finally return to poll to see a hint about which configs to change

Copy link
Contributor Author

@lianetm lianetm May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I did like the simplified log but totally agree with your points, both. I've been myself pushing for avoiding changing the existing logs content when possible because I've also heard about customers basing their apps on them. Also agree about the more complete output on the case of not hitting the next poll in a sensible time.

So left the log here unchanged (and simplified the other just to not repeat ourselves on the 2 logs). So in the common case that we end up with the 2 log lines, it's just a first one about the situation when it happens, and the 2nd one with the approximate exceeded time when we have the most accurate info. Makes sense?

Copy link
Contributor

@ableegoldman ableegoldman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thanks! LGTM

@lianetm
Copy link
Contributor Author

lianetm commented May 10, 2024

Thanks all for the helpful feedback! Let's wait for the build and we should be good @mjsax

@lianetm
Copy link
Contributor Author

lianetm commented May 13, 2024

Build completed with 12 unrelated test failures.

Copy link
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @lianetm!

I think I can use the new Timer method in a couple of other logging output, too, so 👍

Copy link
Contributor

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @lianetm!

I think I can use the new Timer method in a couple of other logging output, too, so 👍

@ableegoldman ableegoldman merged commit e18f61c into apache:trunk May 14, 2024
1 check failed
@ableegoldman
Copy link
Contributor

Merged to trunk

thanks @lianetm !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants