KAFKA-16695: Improve expired poll logging #15909

lianetm · 2024-05-09T15:51:28Z

Improve consumer log for expired poll timer, by showing how much time was the max.poll.interval.ms exceeded. This should be helpful in guiding the user to tune that config on the common case of long-running processing causing the consumer to leave the group. Inspired by other clients that log such information on the same situation.

lianetm · 2024-05-09T15:52:28Z

Hey @mjsax , here is the improved logging following your suggestion, helpful indeed I expect. Would you have a chance to take a look? Thanks!

mjsax

Thanks! -- Need to wait for Jenkins to pass before merging. LGTM.

ableegoldman

Love the idea of this, but wouldn't it be more useful to log the amount of time until the consumer polls next, not until the heartbeat thread polls

ableegoldman · 2024-05-09T20:36:21Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/HeartbeatRequestManager.java

@@ -193,11 +193,12 @@ public NetworkClientDelegate.PollResult poll(long currentTimeMs) {
        }
        pollTimer.update(currentTimeMs);
        if (pollTimer.isExpired() && !membershipManager.isLeavingGroup()) {
-            logger.warn("Consumer poll timeout has expired. This means the time between " +
+            logger.warn("Consumer poll timeout has expired, exceeded by {} ms. This means the time between " +


IIUC this is what gets logged when the heartbeat thread notices the consumer has failed to poll in time and dropped out of the group -- so the "time exceeded" is just going to be roughly the max poll interval + the heartbeat interval, no?

I do think it's a great idea to log the amount of time by which the max poll interval was exceeded, but imo the more useful information is how long after the max poll interval the consumer took to actually hit poll again, not how long the heartbeat thread took to notice it.

I see what you mean. I think that the background thread will notice more quickly than you said, but this just means the "time exceeded" is going to be very close to max poll interval. The heartbeat request manager checks to see whether it is time to send a heartbeat more regularly than it actually sends a heartbeat.

Maybe enhancing the logging in HeartbeatRequestManager.resetPollTimer would be a suitable point. This is where the heartbeat request manager will notice that it has already left the group because of delinquent polling, and rejoins when the next poll occurs. @lianetm that's probably workable I think.

Hey, good point, it would actually take this a step further, where indeed should be more useful. As @AndrewJSchofield pointed, the HB manager will notice sooner in practice (even sooner than the HB interval), but we do know when the next poll happens, so can definitely get a more accurate exceed time (in-between calls to poll, which translates to poll events handled in this same manager). On it...thanks for the comments!

lianetm · 2024-05-10T14:14:48Z

Done, so I simplified what we log when the background thread realizes time's up and leaves the group to rejoin eventually (that's all the relevant info at that point). I then moved the log that details the expired max.poll.interval to the place where we can give a more accurate exceeded time, which is on the next app poll event that the background handles. Also updated the test to make sure it checks not only how the exceed time is calculated, but also where it is calculated. Makes sense? More accurate now indeed, thanks!

AndrewJSchofield

The updated PR is definitely better.

Because we have the application thread and the background thread running concurrently, and the application thread waiting in a long poll(Duration) actually polls internally, what we are measuring here is the time since the last of these internal polls, which will be approximately the end of the application's latest call to poll(Duration). I think that's going to be good enough for this purpose, helping the user understand whether they need to increase max.poll.interval.ms.

AndrewJSchofield · 2024-05-10T14:54:24Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/HeartbeatRequestManager.java

        if (pollTimer.isExpired()) {
-            logger.debug("Poll timer has been reset after it had expired");
+            logger.warn("Time between subsequent calls to poll() was longer than the configured" +
+                "max.poll.interval.ms, exceeded by %s ms. This typically implies that the " +


yeap, my bad, I had found it too so it's fixed in a commit above

AndrewJSchofield · 2024-05-10T15:44:17Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/HeartbeatRequestManager.java

@@ -255,11 +257,15 @@ public long maximumTimeToWait(long currentTimeMs) {
     * member to {@link MemberState#JOINING}, so that it rejoins the group.
     */
    public void resetPollTimer(final long pollMs) {
+        pollTimer.update(pollMs);
        if (pollTimer.isExpired()) {


I would rather have a method added to Timer such as long hasExpiredBy() so the check for expiration and the calculation of by how much is encapsulated in the timer itself.

agree, makes total sense, so moved the calculation to the timer, with an isExpiredBy. Small twist to what I understand you were suggesting, I kept the isExpired check, just to avoid having to deal with the logic of deducing if the timer is expired based on the isExpiredBy on the HBManager. Seems better to let the timer know the semantics of when it's considered expired (it does consider >= for instance, so just avoiding to bring those semantics into the HBManager). Makes sense?

Yes, makes sense. When I was reviewing the previous iteration, I found myself looking within the Timer at the internal variables and then trying to figure out whether the derivation being performed was valid. Makes sense to do it within the Timer. Perfectly happy with 2 methods like this.

lianetm · 2024-05-10T17:16:10Z

Just to clarify what we're getting here, related to @AndrewJSchofield 's very valid point. With this we get the time between internal poll events, which do not translate exactly to calls to consumer.poll depending on the situation. So the log here will be very helpful to tune the config in cases where the delay that led to leaving the group was due to the client app taking too long to process messages after a call to poll for example. It would be less accurate in cases where the delay is due to the fetch not getting messages, since we internally generate more poll events while at it.

AndrewJSchofield · 2024-05-10T18:26:09Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/HeartbeatRequestManager.java

@@ -255,11 +257,15 @@ public long maximumTimeToWait(long currentTimeMs) {
     * member to {@link MemberState#JOINING}, so that it rejoins the group.
     */
    public void resetPollTimer(final long pollMs) {
+        pollTimer.update(pollMs);
        if (pollTimer.isExpired()) {


Yes, makes sense. When I was reviewing the previous iteration, I found myself looking within the Timer at the internal variables and then trying to figure out whether the derivation being performed was valid. Makes sense to do it within the Timer. Perfectly happy with 2 methods like this.

ableegoldman

One nit about the logging, but overall this now looks good to me!

ableegoldman · 2024-05-10T20:29:52Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/HeartbeatRequestManager.java

@@ -193,11 +193,8 @@ public NetworkClientDelegate.PollResult poll(long currentTimeMs) {
        }
        pollTimer.update(currentTimeMs);
        if (pollTimer.isExpired() && !membershipManager.isLeavingGroup()) {
-            logger.warn("Consumer poll timeout has expired. This means the time between " +


Can you actually leave this log untouched? On the one hand I kind of agree with this simplification, and logs are by no means a part of the public contract, but I know for a fact that some people have built observation tools and/or dashboards for things like rebalancing issues by searching for relevant log strings such as this one (I know because I built one myself a long time ago)

I don't feel super strongly about this so I won't push back if you'd prefer to clean it up, but imo it doesn't hurt to leave the log here as well

Also: in some extreme cases, eg an infinite loop in a user's processing logic, the consumer might never return to call poll at all. In less extreme cases, eg some kind of long processing that takes on the order of minutes per record, it might be a very very long time before the consumer gets back to poll and logs the message you added. For the latter case, I think it would be valuable to keep this part about increasing the max.poll.interval or lowering the max.poll.records in the message we log here, when the max poll interval is first missed, so that users know what to do immediately and don't have to wait until they actually get through all 1000 records (or whatever max.poll.records is set to) and finally return to poll to see a hint about which configs to change

Done, I did like the simplified log but totally agree with your points, both. I've been myself pushing for avoiding changing the existing logs content when possible because I've also heard about customers basing their apps on them. Also agree about the more complete output on the case of not hitting the next poll in a sensible time.

So left the log here unchanged (and simplified the other just to not repeat ourselves on the 2 logs). So in the common case that we end up with the 2 log lines, it's just a first one about the situation when it happens, and the 2nd one with the approximate exceeded time when we have the most accurate info. Makes sense?

ableegoldman

Awesome. Thanks! LGTM

lianetm · 2024-05-10T21:09:12Z

Thanks all for the helpful feedback! Let's wait for the build and we should be good @mjsax

lianetm · 2024-05-13T12:57:53Z

Build completed with 12 unrelated test failures.

kirktrue

LGTM! Thanks @lianetm!

I think I can use the new Timer method in a couple of other logging output, too, so 👍

kirktrue

LGTM! Thanks @lianetm!

I think I can use the new Timer method in a couple of other logging output, too, so 👍

ableegoldman · 2024-05-14T01:05:15Z

Merged to trunk

thanks @lianetm !

Log exceeded time & test

870b6ba

AndrewJSchofield approved these changes May 9, 2024

View reviewed changes

mjsax added the consumer label May 9, 2024

mjsax approved these changes May 9, 2024

View reviewed changes

ableegoldman requested changes May 9, 2024

View reviewed changes

lianetm added 2 commits May 10, 2024 10:05

More accurate exceeded time on resetPollTimer

a261894

extend test to check when the exceeded time is used

d6e14ad

lianetm requested review from ableegoldman and AndrewJSchofield May 10, 2024 14:15

typo in msg

abe3c5a

AndrewJSchofield reviewed May 10, 2024

View reviewed changes

lianetm added 2 commits May 10, 2024 12:54

calculation in timer

eff099a

update test

1ebdddd

AndrewJSchofield approved these changes May 10, 2024

View reviewed changes

ableegoldman reviewed May 10, 2024

View reviewed changes

lianetm added 2 commits May 10, 2024 16:39

Update log messages

6ce27fe

Add missing space in new log msg

5e9c046

ableegoldman approved these changes May 10, 2024

View reviewed changes

kirktrue reviewed May 13, 2024

View reviewed changes

kirktrue approved these changes May 13, 2024

View reviewed changes

ableegoldman merged commit e18f61c into apache:trunk May 14, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16695: Improve expired poll logging #15909

KAFKA-16695: Improve expired poll logging #15909

lianetm commented May 9, 2024

lianetm commented May 9, 2024

mjsax left a comment

ableegoldman left a comment

ableegoldman May 9, 2024

AndrewJSchofield May 10, 2024

lianetm May 10, 2024 •

edited

lianetm commented May 10, 2024

AndrewJSchofield left a comment

AndrewJSchofield May 10, 2024

lianetm May 10, 2024

AndrewJSchofield May 10, 2024

lianetm May 10, 2024

AndrewJSchofield May 10, 2024

lianetm commented May 10, 2024 •

edited

AndrewJSchofield May 10, 2024

ableegoldman left a comment

ableegoldman May 10, 2024

lianetm May 10, 2024 •

edited

ableegoldman left a comment

lianetm commented May 10, 2024

lianetm commented May 13, 2024

kirktrue left a comment

kirktrue left a comment

ableegoldman commented May 14, 2024

KAFKA-16695: Improve expired poll logging #15909

KAFKA-16695: Improve expired poll logging #15909

Conversation

lianetm commented May 9, 2024

lianetm commented May 9, 2024

mjsax left a comment

Choose a reason for hiding this comment

ableegoldman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm May 10, 2024 • edited

Choose a reason for hiding this comment

lianetm commented May 10, 2024

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm commented May 10, 2024 • edited

Choose a reason for hiding this comment

ableegoldman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lianetm May 10, 2024 • edited

Choose a reason for hiding this comment

ableegoldman left a comment

Choose a reason for hiding this comment

lianetm commented May 10, 2024

lianetm commented May 13, 2024

kirktrue left a comment

Choose a reason for hiding this comment

kirktrue left a comment

Choose a reason for hiding this comment

ableegoldman commented May 14, 2024

lianetm May 10, 2024 •

edited

lianetm commented May 10, 2024 •

edited

lianetm May 10, 2024 •

edited