Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid negative sequence number used #898

Open
RagingPuppies opened this issue Mar 11, 2022 · 0 comments
Open

Invalid negative sequence number used #898

RagingPuppies opened this issue Mar 11, 2022 · 0 comments
Labels

Comments

@RagingPuppies
Copy link

Subject of the issue

On high load brooklin clusters once enabled flushless i get tons of these errors,
producers get killed but there are some partitions that keeps growing lag until i restart the service

Your environment

  • Ubuntu 18.04
  • Brooklin version 1.1.0
  • Java version 1.8.0_152
  • Kafka version 2.5.2
  • ZooKeeper version 3.4

Steps to reproduce

my default producer settings:

default['brooklin_v2']['default_provider_properties']['buffer.memory'] = '61600000'
default['brooklin_v2']['default_provider_properties']['batch.size'] = '80000'
default['brooklin_v2']['default_provider_properties']['linger.ms'] = '15000'
default['brooklin_v2']['default_provider_properties']['request.timeout.ms'] = '300000'
default['brooklin_v2']['default_provider_properties']['compression.type'] = 'gzip'
default['brooklin_v2']['default_provider_properties']['producersPerTask'] = '3'
default['brooklin_v2']['default_provider_properties']['numProducersPerConnector'] = '15'
default['brooklin_v2']['default_provider_properties']['producerRateLimiter'] = '0.05'
default['brooklin_v2']['default_provider_properties']['retries'] = '100'
default['brooklin_v2']['default_provider_properties']['retry.backoff.ms'] = '300'
default['brooklin_v2']['default_provider_properties']['enable.idempotence'] = 'true'
default['brooklin_v2']['default_provider_properties']['acks'] = 'all'

my default consumer:

default['brooklin_v2']['default_connector_properties']['consumer.partition.assignment.strategy'] = 'org.apache.kafka.clients.consumer.RoundRobinAssignor'
default['brooklin_v2']['default_connector_properties']['factoryClassName'] = 'com.linkedin.datastream.connectors.kafka.mirrormaker.KafkaMirrorMakerConnectorFactory'
default['brooklin_v2']['default_connector_properties']['assignmentStrategyFactory'] = 'com.linkedin.datastream.server.assignment.BroadcastStrategyFactory'
default['brooklin_v2']['default_connector_properties']['consumer.receive.buffer.bytes'] = '4096000'
default['brooklin_v2']['default_connector_properties']['consumer.session.timeout.ms'] = '120000'
default['brooklin_v2']['default_connector_properties']['consumer.heartbeat.interval.ms'] = '9000'
default['brooklin_v2']['default_connector_properties']['consumer.request.timeout.ms'] = '60000'
default['brooklin_v2']['default_connector_properties']['consumer.auto.offset.reset'] = 'latest'
default['brooklin_v2']['default_connector_properties']['commitIntervalMs'] = '30000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.min.bytes'] = '500000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.max.bytes'] = '4000000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.max.wait.ms'] = '250'
default['brooklin_v2']['default_connector_properties']['pausePartitionOnError'] = 'true'
default['brooklin_v2']['default_connector_properties']['pauseErrorPartitionDurationMs'] = '180000'

other then that i've added "isFlushlessModeEnabled": "true" to my connectors settings
my brooklin cluster is processing around 600 MBps with 3 datastreams 450 consumers and around 1000 producers.
i have 3 more clusters that are suffering from the same issues which started about 2 weeks after the implemntation.

Expected behaviour

producer with OUT_OF_ORDER_SEQUENCE_NUMBER should retry and die at some point
new producer should raise and get new PID, this is overall happening but it seems that 1~few partitions doesn't heal
there are no errors realted to this partitions but it just seem that brooklin "forgets" about them
in this screentshot, you can see most of the partitions are getting consumed after the issue eccours but there is one that only after service restart was back to consumption, from brooklin side i don't even see that the lag metric shows that there is lag which is strange
Screen Shot 2022-03-11 at 18 37 59

Actual behaviour

all partitions should be healed after producers spawns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant