New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coop sticky algo on large partition number #4629
Comments
Any thoughts on this problem? Are more details needed? |
@ericwuseattle could you send some logs with |
unfortunately I do not have the evn was set up to testing in my hand right now, have you checked the hard code of partition count in the code, if we could fix that part first then I'll find time to retry it. /* FIXME: Let the cgrp pass the actual eligible partition count / https://github.com/confluentinc/librdkafka/blob/master/src/rdkafka_sticky_assignor.c#L1834 |
Given your configuration you're not using the sticky assignor as as the default |
that is just the estimated partition count used for the the initial size of maps and lists. You can try to increase the multiplier and see if it changes something and send some logs of the leader and 2-3 random members. |
Sorry, did not give you all the config, but we did have the setting for |
@ericwuseattle thanks, other helpful info is:
|
|
Description
There are 2 issues, I noticed on kafka coop sticky mode.
The hard code on partition_cnt inside rd_kafka_sticky_assignor_assign_cb
https://github.com/confluentinc/librdkafka/blob/master/src/rdkafka_sticky_assignor.c#L1834
On 3K partitions, it's working without issue, but if I increase the partition to 6K with fresh topic(I mean recreate the topic as new one). Have to increase the session.timeout.ms=10000 and max.poll.interval.ms=10000 from 3s to 10s to make it working.
Otherwise will get kicked out from grp
Broker logs:
Member XXX-6F958DDF5F-CDXRQ~-0793c679-d5ef-4753-9056-7da314e1415b in group XXX-TOPIC-NAME-XXX has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator).
I'm not sure what's causing the timeout, but I'm sure we keep calling kafka poll in a timer infinitely. So increase into 10s it's working without any issue.
Trying further on 15K partitions with 10s, no lucky, would not work, gets kicked out from grp.
Overall,
3K partitions 3s timeout, works.
6K partitions 3s timeout, not work.
6K partitions 10s timeout, works.
15K partitions 10s timeout, not work.
How to reproduce
Large partition numbers.
6K partitions with 3s session timeout
or
15K partitions with 10s session timeout
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
<2.3.0>
<3.0>
<fetch.min.bytes=1, fetch.wait.max.ms=500, fetch.error.backoff.ms=0, heartbeat.interval.ms=1000, enable.auto.commit=false, enable.partition.eof=false, enable.auto.offset.store=false, max.poll.interval.ms=3000, session.timeout.ms=3000, partition.assignment.strategy=cooperative-sticky>
Ubuntu(x64)>
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: