Verify repository before cluster update #108531

mhl-b · 2024-05-10T23:22:02Z

This PR addresses issue when we can create "invalid" repository and persist it
in cluster state. Even thou we return error to caller, repository stays there.

Add repository verification only on master node before cluster update.
Refactored registerRepository code with SubscribableListener to display order
of listener calls.

fixes #107840

elasticsearchmachine · 2024-05-10T23:22:27Z

Pinging @elastic/es-distributed (Team:Distributed)

mhl-b · 2024-05-11T00:15:52Z

Some integ tests in s3 and gcp are failing on new verification call. I assume they worked before because we updated cluster first and repository was there, not verified.

I'm looking how to fix this.

DaveCTurner

Overall direction looks good but this PR is combining too many independent changes which makes it harder to review and also means that if it happens to introduce a bug then a git bisect won't pin down the problem very precisely.

Could we break out the test suite improvements and merge that change first, and separate out the cosmetic variable renaming in MasterService too? That way we'll be left with a PR that just does the behaviour change.

ywangd · 2024-05-13T05:36:36Z

+1 to David's suggestion. I assume this PR is meant to fix #107840? If so, can we make it clear in the description, i.e. replaces related with something like fixes so that we get GitHub to associate them?

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

mhl-b · 2024-05-13T17:54:55Z

@DaveCTurner @ywangd

Could we break out the test suite improvements and merge that change first

I will create separate PR for the test suite improvements, will keep this one open for behaviour change.

…108589) I observed that `testRegisterRepositorySuccessAfterCreationFailed` test never invokes assertion blocks, because listener is not invoked. There are 2 problems: 1. Test setup used mocks. Mocks interrupt listener chain propagation, so registerRepository never returned Response or Failure. 2. We silently ignore assertions in listener because it is not invoked. Test pass successfully. PutRepositories method relies on cluster state update. I replace mocked ClusterService and ThreadPool with test implementation of these. Also add blocking call on listener to ensure we get result. Address [comment](#108531 (review)) to break down larger PR into smaller pieces in #108531

mhl-b · 2024-05-16T02:05:02Z

@DaveCTurner @idegtiarenko
After try and err with enhancing validation I decided to change approach. I updated PR description and code. WDYT.

DaveCTurner

I don't think this works unfortunately, at least not reliably enough. The unregisterRepository step may fail if, for instance, there's a master failover in the meantime, or something starts a snapshot operation in between the register and unregister steps. We need to do at least some kind of verification before adding the repo metadata to the cluster state.

ywangd · 2024-05-16T08:58:14Z

In addition to @DaveCTurner's comment, I think this approach also means we will delete an existing repository if the update request has some invalid settings in it. Feels like a pretty big risk to do so.

DaveCTurner · 2024-05-16T08:59:33Z

oof yeah that too!

mhl-b · 2024-05-16T14:23:17Z

Alright, back to verification :)

mhl-b · 2024-05-17T04:01:03Z

@DaveCTurner @ywangd
updated PR, using pre-publish verification now, found and fixed complaining tests

ywangd

I left a comment. Thanks!

ywangd · 2024-05-17T04:29:38Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

+                if (request.verify()) {
+                    validatePutRepositoryRequest(request, validationStep);


Can we have this done together with validateRepositoryCanBeCreated? I think it feels nature that pre-cluster-state-update validation/verification is done in one place. Also seems to be wasteful to create the repository twice.

While I appreciate that SubscribableListener generally makes code easier to read, can we keep it separate from this PR? If we move this step into validateRepositoryCanBeCreated, I think we don't need to touch the code here at all in this PR which would make it much easier to review. We can always have a follow-up which will be pure-refactoring for introucing SubscribableListener.

validateRepositoryCanBeCreated method is used in other places, I would rather remove it from here. But it breaks unrelated tests. I can try to chase these down in this PR, or following PR.

Another problem is behaviour change, validateRepositoryCanBeCreated runs despite "verify" flag, but repository verification behind "verify". So putting everything behind verify or including new verification logic regardless "verify" breaks another set of tests including bwc.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

I think it is better to keep functional change and style change separate unless it is very trivial. If somehow we need to revert the functional change, it is also much cleaner if they are separate. This likely won't happen for this PR. But it feels like a overall good principle.

validateRepositoryCanBeCreated method is used in other places

In that case, we don't have to literally move the new code into it. It can sit besides it as a new method. My main point is to not touch the below steps of listeners in this PR.

Let's hear what @DaveCTurner has to say about this before changing anything. Thanks!

Btw, the SubscribableListener change does not have to come after this PR. It would be totally reasonable to have a separate PR just for that and merge it before proceeding here. My main point is to have them separate, order is not important.

You are absolutely right about splitting it apart, I talked with David before about it, but got tunnel-visioned chasing down test failures and forgot about it. I rethink what I said before, and I believe all of what you mentioned above can be addressed.

I created PR with non functional change - #108788

I have some thoughts how to organize validateRepositoryCanBeCreated better, will update this PR after refactoring merged.

Awesome. That was my main gripe :) Thanks for splitting it out. 👍

updated PR with refactoring included

This PR is a syntactic change for `registerRepository` in `RepositoriesService`. I use `SubscribableListener` to display order of events and reduce boilerplate code around failures delegation `listener.delegateFailureAndWrap`. It's a part of larger change for verification logic, which should take advantage of this "sequential" version of code. #108531

DaveCTurner

Looks good to go except a couple of superficial points, see below.

DaveCTurner · 2024-05-23T06:54:03Z

server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java

+                threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
+                    try {
+                        final var token = repository.startVerification();
+                        if (token != null) {
+                            repository.verify(token, clusterService.localNode());
+                            repository.endVerification(token);
+                        }
+                        resultListener.onResponse(null);
+                    } catch (Exception e) {
+                        resultListener.onFailure(e);
+                    } finally {
+                        closeRepository(repository);
+                    }
+                });


It's generally best to avoid submitting a bare Runnable to threadpool executors, because they don't handle rejection very well. In this particular case it'd be ok since (a) SNAPSHOT doesn't reject anything (today) and (b) the whole thing is in a try/catch block anyway, but still these things are easy to miss in future changes in this area. AbstractRunnable is always a better choice:

Suggested change

threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {

try {

final var token = repository.startVerification();

if (token != null) {

repository.verify(token, clusterService.localNode());

repository.endVerification(token);

}

resultListener.onResponse(null);

} catch (Exception e) {

resultListener.onFailure(e);

} finally {

closeRepository(repository);

}

});

threadPool.executor(ThreadPool.Names.SNAPSHOT)

.execute(ActionRunnable.run(ActionListener.runBefore(resultListener, () -> closeRepository(repository)), () -> {

final var token = repository.startVerification();

if (token != null) {

repository.verify(token, clusterService.localNode());

repository.endVerification(token);

}

}));

DaveCTurner · 2024-05-23T06:56:40Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
        }
    }

+    public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
+        return safeAwait(SubscribableListener.newForked(exceptionListener -> {


Nit: unnecessary {...}, can just be an expression lambda

Suggested change

return safeAwait(SubscribableListener.newForked(exceptionListener -> {

return safeAwait(SubscribableListener.newForked(exceptionListener ->

DaveCTurner · 2024-05-23T06:56:42Z

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
        }
    }

+    public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {


Since we don't use the generic type param T anywhere we can just use a wildcard instead:

Suggested change

public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {

public static Exception safeAwaitFailure(SubscribableListener<?> listener) {

mhl-b added >enhancement >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team v8.15.0 labels May 10, 2024

mhl-b requested review from ywangd, DiannaHohensee and DaveCTurner May 10, 2024 23:22

DaveCTurner reviewed May 12, 2024

View reviewed changes

idegtiarenko reviewed May 13, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java Outdated Show resolved Hide resolved

idegtiarenko reviewed May 13, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java Outdated Show resolved Hide resolved

DaveCTurner mentioned this pull request May 13, 2024

Ensure listener called in testRegisterRepositorySuccessAfterCreationFailed #108477

Closed

DaveCTurner reviewed May 13, 2024

View reviewed changes

test/framework/src/main/java/org/elasticsearch/test/ESTestCase.java Show resolved Hide resolved

DaveCTurner mentioned this pull request May 13, 2024

Extract SAFE_AWAIT_TIMEOUT constant #108554

Merged

mhl-b mentioned this pull request May 13, 2024

Replace mocks with test implementation in RepositoriesServiceTests #108589

Merged

unregister repository if validation fails

0474f8c

mhl-b force-pushed the repo-validation branch from dc4a5a1 to 0474f8c Compare May 15, 2024 02:50

mhl-b changed the title ~~Validate repository before cluster update~~ Unregister repository when PutRepository validation fails May 15, 2024

mhl-b requested review from idegtiarenko and DaveCTurner May 15, 2024 03:03

mhl-b added 5 commits May 14, 2024 20:10

naming

2d63dc6

naming

bfa15df

fix logger

66eaa22

Merge remote-tracking branch 'upstream/main' into repo-validation

58d6852

revert executor type

4892603

mhl-b added 2 commits May 15, 2024 15:43

add more logging, UT and IT changes

8c4e77e

remove IT test

badc639

mhl-b changed the title ~~Unregister repository when PutRepository validation fails~~ Unregister repository when PutRepository verification fails May 16, 2024

DaveCTurner requested changes May 16, 2024

View reviewed changes

add verification before cluster update

98bffa9

mhl-b changed the title ~~Unregister repository when PutRepository verification fails~~ Verify repository before cluster update May 16, 2024

mhl-b added 3 commits May 16, 2024 14:21

Merge remote-tracking branch 'upstream/main' into repo-validation

cd50ec2

fix bwc

ed55c01

fix test deadlock

d95ecd2

mhl-b requested a review from DaveCTurner May 17, 2024 02:54

bring comments back

e7b5bb6

ywangd reviewed May 17, 2024

View reviewed changes

mhl-b mentioned this pull request May 17, 2024

Refactor registerRepository method #108788

Merged

update

fcd79a6

mhl-b requested review from a team as code owners May 21, 2024 22:41

update

418e167

mhl-b requested review from ywangd and removed request for a team May 22, 2024 00:13

DaveCTurner reviewed May 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify repository before cluster update #108531

Verify repository before cluster update #108531

mhl-b commented May 10, 2024 •

edited

elasticsearchmachine commented May 10, 2024

mhl-b commented May 11, 2024 •

edited

DaveCTurner left a comment

ywangd commented May 13, 2024

mhl-b commented May 13, 2024

mhl-b commented May 16, 2024

DaveCTurner left a comment

ywangd commented May 16, 2024

DaveCTurner commented May 16, 2024

mhl-b commented May 16, 2024

mhl-b commented May 17, 2024

ywangd left a comment

ywangd May 17, 2024

mhl-b May 17, 2024 •

edited

ywangd May 17, 2024

ywangd May 17, 2024 •

edited

mhl-b May 17, 2024 •

edited

ywangd May 18, 2024

mhl-b May 22, 2024

DaveCTurner left a comment

DaveCTurner May 23, 2024

DaveCTurner May 23, 2024

DaveCTurner May 23, 2024

		if (request.verify()) {
		validatePutRepositoryRequest(request, validationStep);

	return safeAwait(SubscribableListener.newForked(exceptionListener -> {
	return safeAwait(SubscribableListener.newForked(exceptionListener ->

	public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
	public static Exception safeAwaitFailure(SubscribableListener<?> listener) {

Verify repository before cluster update #108531

Are you sure you want to change the base?

Verify repository before cluster update #108531

Conversation

mhl-b commented May 10, 2024 • edited

elasticsearchmachine commented May 10, 2024

mhl-b commented May 11, 2024 • edited

DaveCTurner left a comment

Choose a reason for hiding this comment

ywangd commented May 13, 2024

mhl-b commented May 13, 2024

mhl-b commented May 16, 2024

DaveCTurner left a comment

Choose a reason for hiding this comment

ywangd commented May 16, 2024

DaveCTurner commented May 16, 2024

mhl-b commented May 16, 2024

mhl-b commented May 17, 2024

ywangd left a comment

Choose a reason for hiding this comment

ywangd May 17, 2024

Choose a reason for hiding this comment

mhl-b May 17, 2024 • edited

Choose a reason for hiding this comment

ywangd May 17, 2024

Choose a reason for hiding this comment

ywangd May 17, 2024 • edited

Choose a reason for hiding this comment

mhl-b May 17, 2024 • edited

Choose a reason for hiding this comment

ywangd May 18, 2024

Choose a reason for hiding this comment

mhl-b May 22, 2024

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner May 23, 2024

Choose a reason for hiding this comment

DaveCTurner May 23, 2024

Choose a reason for hiding this comment

DaveCTurner May 23, 2024

Choose a reason for hiding this comment

mhl-b commented May 10, 2024 •

edited

mhl-b commented May 11, 2024 •

edited

mhl-b May 17, 2024 •

edited

ywangd May 17, 2024 •

edited

mhl-b May 17, 2024 •

edited