Implement timeline_manager for safekeeper background tasks #7768

petuhovskiy · 2024-05-15T13:01:20Z

In safekeepers we have several background tasks. Previously WAL backup task was spawned by another task called wal_backup_launcher. That task received notifications via wal_backup_launcher_rx and decided to spawn or kill existing backup task associated with the timeline. This was inconvenient because each code segment that touched shared state was responsible for pushing notification into wal_backup_launcher_tx channel. This was error prone because it's easy to miss and could lead to deadlock in some cases, if notification pushing was done in the wrong order.

We also had a similar issue with is_active timeline flag. That flag was calculated based on the state and code modifying the state had to call function to update the flag. We had a few bugs related to that, when we forgot to update is_active flag in some places where it could change.

To fix these issues, this PR adds a new timeline_manager background task associated with each timeline. This task is responsible for managing all background tasks, including is_active flag which is used for pushing broker messages. It is subscribed for updates in timeline state in a loop and decides to spawn/kill background tasks when needed.

There is a new structure called TimelinesSet. It stores a set of Arc<Timeline> and allows to copy the set to iterate without holding the mutex. This is what replaced is_active flag for the broker. Now broker push task holds a reference to the TimelinesSet with active timelines and use it instead of iterating over all timelines and filtering by is_active flag.

Also added some metrics for manager iterations and active backup tasks. Ideally manager should be doing not too many iterations and we should not have a lot of backup tasks spawned at the same time.

Fixes #7751

github-actions · 2024-05-15T14:31:45Z

3096 tests run: 2969 passed, 0 failed, 127 skipped (full report)

Flaky tests (2)

Postgres 16

test_synthetic_size_while_deleting: debug
test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

functions: 31.3% (6414 of 20481 functions)
lines: 48.0% (49319 of 102667 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
e030838 at 2024-05-21T23:14:08.491Z :recycle:}

arssher · 2024-05-16T09:40:52Z

+1, I dislike these lock wal backup launcher notifications which must be outside lock so thought about something similar.

The simplest thing which first came to my mind was just to remove any launcher and spin up / stop task directly by any task updating the status (under lock). However, this is problematic because we don't want to have two running two tasks of the type at the same time, so there should be some await'ing when we stop the task, which shouldn't happen under global timeline lock due to potential deadlock danger.

Separate manager task solves this because 1) notifying manager doesn't need await due to watch channel instead of usual one 2) manager itself can keep its state locally or under different lock than timeline lock.

It should probably still publish tasks state in some shared memory to allow outside code wait for tasks termination (pause, probably freezing when timeline is being evicted), but that's a different subject.

I also think we'd likely benefit from generalizing task itself to have common code for cancelling and waiting for termination.

safekeeper/src/timeline.rs

safekeeper/src/timeline_manager.rs

petuhovskiy · 2024-05-17T00:03:25Z

Ok, so I've done everything I've wanted in this PR. Now I'll look into test failures and that should be it.
@arssher you can do another review iteration if you want, I'm not going to change anything but bugfixes.

I also think we'd likely benefit from generalizing task itself to have common code for cancelling and waiting for termination.

Sure, one step at a time and we'll get there.

arssher

LGTM. Pushed minor fix for logging backup task stop and logging context for broker pull (wanted to learn who calls write_shared_state and it missed it).

safekeeper/src/timeline_manager.rs

arssher · 2024-05-21T14:17:39Z

Also, high level comment for timeline_manager.rs would be nice (partial copy of PR description would work).

safekeeper/src/timeline_manager.rs

jcsp · 2024-05-21T17:00:37Z

Structurally this looks like a good direction. Couple of general non-PR-blocking thoughts as we continue to evolve this code:

Various functions could be pub(crate) rather than bare pub to make refactoring later easier (it's harder for linter can't detect unused pub stuff)
In places we'd like to block other code waiting for background task shutdown, the Gate helper could be a good fit. It has the nice behavior of printing information if it blocks unexpectedly long. I spend 5 minutes playing with that here, seems like it drops in pretty easily: https://github.com/neondatabase/neon/commits/jcsp/sk-concurrency-primitives/

petuhovskiy · 2024-05-21T22:31:51Z

Various functions could be pub(crate) rather than bare pub to make refactoring later easier (it's harder for linter can't detect unused pub stuff)

Yes, should be done later. Also it makes sense to better (re)organize safekeeper code into directories.

the Gate helper could be a good fit.

Agree, but needs more refactoring to make all tasks use the gate. Created an issue #7831

safekeeper: use CancellationToken instead of watch channel

This change looks good, would like to commit it right after this PR.

I looked at the metrics from #7768 on staging and it seems that manager does too many iterations. This is probably caused by background job `remove_wal.rs` which iterates over all timelines and tries to remove WAL and persist control file. This causes shared state updates and wakes up the manager. The fix is to skip notifying about the updates if nothing was updated.

petuhovskiy force-pushed the sk-refactor-backup-launcher branch from 1e7f2c6 to 14af1f0 Compare May 15, 2024 15:40

arssher reviewed May 16, 2024

View reviewed changes

safekeeper/src/timeline.rs Show resolved Hide resolved

safekeeper/src/timeline.rs Show resolved Hide resolved

safekeeper/src/timeline.rs Show resolved Hide resolved

safekeeper/src/timeline_manager.rs Show resolved Hide resolved

petuhovskiy force-pushed the sk-refactor-backup-launcher branch 2 times, most recently from b3c4735 to 9bf2384 Compare May 16, 2024 23:59

petuhovskiy force-pushed the sk-refactor-backup-launcher branch 2 times, most recently from f0a5beb to 78459ae Compare May 20, 2024 12:58

petuhovskiy marked this pull request as ready for review May 20, 2024 12:58

petuhovskiy requested a review from a team as a code owner May 20, 2024 12:58

petuhovskiy requested review from jcsp and arssher May 20, 2024 12:58

arssher approved these changes May 21, 2024

View reviewed changes

safekeeper/src/timeline_manager.rs Outdated Show resolved Hide resolved

safekeeper/src/timeline_manager.rs Show resolved Hide resolved

koivunej reviewed May 21, 2024

View reviewed changes

safekeeper/src/timeline_manager.rs Outdated Show resolved Hide resolved

petuhovskiy and others added 11 commits May 21, 2024 22:24

Refactor backup launcher into timeline manager

ce2630b

Implement TimelineSet for storing active timelines

d507ab9

Fix WAL_BACKUP_RUNTIME usage

832b89d

Do self review

27125c9

Add comments

14bc1aa

Add metrics

363a235

Fix cargo doc

f9110ff

Add manager comment

3fd94c9

Fix logging when backup task is stopped.

b5b5c1b

Add context to broker pull.

f2bc71d

Fix review comments

e030838

petuhovskiy force-pushed the sk-refactor-backup-launcher branch from 1246d53 to e030838 Compare May 21, 2024 22:24

petuhovskiy merged commit bd5cb9e into main May 22, 2024
54 of 55 checks passed

petuhovskiy deleted the sk-refactor-backup-launcher branch May 22, 2024 08:34

petuhovskiy mentioned this pull request May 22, 2024

Skip unnecessary shared state updates in safekeepers #7851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement timeline_manager for safekeeper background tasks #7768

Implement timeline_manager for safekeeper background tasks #7768

petuhovskiy commented May 15, 2024 •

edited

github-actions bot commented May 15, 2024 •

edited

Postgres 16

arssher commented May 16, 2024

petuhovskiy commented May 17, 2024

arssher left a comment

arssher commented May 21, 2024 •

edited

jcsp commented May 21, 2024 •

edited

petuhovskiy commented May 21, 2024

Implement timeline_manager for safekeeper background tasks #7768

Implement timeline_manager for safekeeper background tasks #7768

Conversation

petuhovskiy commented May 15, 2024 • edited

github-actions bot commented May 15, 2024 • edited

3096 tests run: 2969 passed, 0 failed, 127 skipped (full report)

Postgres 16

Code coverage* (full report)

arssher commented May 16, 2024

petuhovskiy commented May 17, 2024

arssher left a comment

Choose a reason for hiding this comment

arssher commented May 21, 2024 • edited

jcsp commented May 21, 2024 • edited

petuhovskiy commented May 21, 2024

petuhovskiy commented May 15, 2024 •

edited

github-actions bot commented May 15, 2024 •

edited

arssher commented May 21, 2024 •

edited

jcsp commented May 21, 2024 •

edited