rm_stm: couple of stability fixes noticed when down scaling max_concurrent_producer_ids #18573

bharathv · 2024-05-18T00:35:15Z

On a cluster with a large pile up of historical producer_ids, a sudden down scale of the configuration resulted in crashes.

eviction logic spins up too many concurrent tasks in the same scheduling point (in this case its > 1M) and that is resulting in a 16MB allocation in the reactor path.
Loading too many snapshots at once (during bootstrap) was causing a big spike in memory usage and we can potentially save some memory by aggressively releasing any unused memory from the snapshot structures that are already loaded in the desired destination..

TBD: Link GH issues.

Backports Required

Release Notes

bharathv · 2024-05-18T00:36:28Z

/dt

travisdowns · 2024-05-18T01:08:32Z

Seems good. Isn't part of the problem that evict loop here is not async so the tasks just accumulate without getting any chance to run?

travisdowns · 2024-05-18T01:15:51Z

How often does the eviction tick? With only 100 per tick is it possible that the PIDs can now grow faster than we evict them?

Instead of spawning on task per PID it seems better to just collect all the PIDs we are going to evict and spawn off a single task to do the eviction. This would also solve the specific allocation failure.

bharathv · 2024-05-20T15:43:00Z

Isn't part of the problem that evict loop here is not async so the tasks just accumulate without getting any chance to run?

Ya right, the original intention was to not have a scheduling point during the eviction tick so the iterators in the list it is traversing are not invalidated. I have another idea that got rid of the async evict function, that seems to be working locally, I'll clean it up and push it shortly.

How often does the eviction tick? With only 100 per tick is it possible that the PIDs can now grow faster than we evict them?
Instead of spawning on task per PID it seems better to just collect all the PIDs we are going to evict and spawn off a single task to do the eviction. This would also solve the specific allocation failure.

It runs every 5s, ya probably 100 per tick is too small.. I think with a non futurized evict implementation (next PS), we can evict more in one go.

When a lot of partitions startup on the shard at the same time, we noticed crashes in this part of the code when the snapshot sizes are non trivial (large # of producers in the snapshot). This patch releases already applied snapshot state to ease the memory pressure a bit.

bharathv · 2024-05-21T06:32:24Z

/ci-repeat 5

bharathv · 2024-05-21T16:02:50Z

/dt

bharathv · 2024-05-21T19:20:30Z

/ci-repeat 5

piyushredpanda · 2024-05-22T04:57:17Z

Known failure: #12897

src/v/cluster/producer_state_manager.h

.. to be used in the next commit.

This test ensures concurrent evictions can happen in the presence of replication operations and operations that reset the state (snapshots, partition stop.

Prior to this commit producer_state::evict() is asynchronous because it waited on a gate to drain out any pending tasks on the producer state. This resulted in a fiber per evict() call in the producer_state_manager resulting in large number of fibers when evicting a ton of producers in one eviction tick (which manifested as a large allocation). This commit fixes the issue by making evict() synchronous and getting rid of the gate that was forcing it. That is ok because the eviction tick runs synchronously (without scheduling points) and only considers candidates that are still linked to the shard wide list. Any caller function using the producer state either waits on op_lock or detaches from the list before running the operation ensuring that it is not considered for eviction.

travisdowns · 2024-05-22T16:54:09Z

Were you able to reproduce the crash with the new tests you added?

bharathv · 2024-05-22T19:49:20Z

Were you able to reproduce the crash with the new tests you added?

The new tests in this PR mainly validates the correctness and memory safety of eviction and not the OOM issue we ran into. I'm not able to reproduce the exact OOM that the user ran into with spawning too many tasks at once, perhaps it needs a lot of background noise in the test to ensure fragmentation. I'm pushing a small test that ensures only limited number of producers are evicted in each tick and that combined with the fact that there are no task per eviction probably addresses the issue.

.. to avoid reactor stalls.

github-actions bot added the area/redpanda label May 18, 2024

bharathv force-pushed the snapshot_producer_fixes branch from 3b0ac32 to 8db9579 Compare May 21, 2024 06:27

bharathv force-pushed the snapshot_producer_fixes branch from 8db9579 to 2ad9dfd Compare May 21, 2024 15:58

bharathv added this to the v23.3.16 milestone May 21, 2024

bharathv requested review from mmaslankaprv, bashtanov, ztlpn and travisdowns May 21, 2024 22:31

bharathv marked this pull request as ready for review May 21, 2024 22:34

mmaslankaprv reviewed May 22, 2024

View reviewed changes

src/v/cluster/producer_state_manager.h Outdated Show resolved Hide resolved

bharathv added 3 commits May 22, 2024 08:09

producer_state_manager: make period configurable for testing

78c7af1

.. to be used in the next commit.

producer_state: add high concurrency test for evicftions.

ab76626

This test ensures concurrent evictions can happen in the presence of replication operations and operations that reset the state (snapshots, partition stop.

bharathv force-pushed the snapshot_producer_fixes branch from 2ad9dfd to 63bb606 Compare May 22, 2024 15:09

bharathv requested a review from mmaslankaprv May 22, 2024 16:06

piyushredpanda modified the milestones: v23.3.16, v23.3.x-next May 22, 2024

bharathv force-pushed the snapshot_producer_fixes branch from 63bb606 to 60868ed Compare May 22, 2024 19:49

producer_state_mgr: limit number of evictions per tick

d7aa167

.. to avoid reactor stalls.

bharathv force-pushed the snapshot_producer_fixes branch from 60868ed to d7aa167 Compare May 22, 2024 23:32

mmaslankaprv approved these changes May 23, 2024

View reviewed changes

piyushredpanda merged commit 165b952 into redpanda-data:v23.3.x May 23, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm_stm: couple of stability fixes noticed when down scaling max_concurrent_producer_ids #18573

rm_stm: couple of stability fixes noticed when down scaling max_concurrent_producer_ids #18573

bharathv commented May 18, 2024 •

edited

bharathv commented May 18, 2024

travisdowns commented May 18, 2024

travisdowns commented May 18, 2024

bharathv commented May 20, 2024

bharathv commented May 21, 2024

bharathv commented May 21, 2024

bharathv commented May 21, 2024

piyushredpanda commented May 22, 2024 •

edited

travisdowns commented May 22, 2024

bharathv commented May 22, 2024

rm_stm: couple of stability fixes noticed when down scaling max_concurrent_producer_ids #18573

rm_stm: couple of stability fixes noticed when down scaling max_concurrent_producer_ids #18573

Conversation

bharathv commented May 18, 2024 • edited

Backports Required

Release Notes

bharathv commented May 18, 2024

travisdowns commented May 18, 2024

travisdowns commented May 18, 2024

bharathv commented May 20, 2024

bharathv commented May 21, 2024

bharathv commented May 21, 2024

bharathv commented May 21, 2024

piyushredpanda commented May 22, 2024 • edited

travisdowns commented May 22, 2024

bharathv commented May 22, 2024

bharathv commented May 18, 2024 •

edited

piyushredpanda commented May 22, 2024 •

edited