Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify repository before cluster update #108531

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mhl-b
Copy link
Contributor

@mhl-b mhl-b commented May 10, 2024

This PR addresses issue when we can create "invalid" repository and persist it
in cluster state. Even thou we return error to caller, repository stays there.

Add repository verification only on master node before cluster update.
Refactored registerRepository code with SubscribableListener to display order
of listener calls.

fixes #107840

@mhl-b mhl-b added >enhancement >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team v8.15.0 labels May 10, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@mhl-b
Copy link
Contributor Author

mhl-b commented May 11, 2024

Some integ tests in s3 and gcp are failing on new verification call. I assume they worked before because we updated cluster first and repository was there, not verified.

I'm looking how to fix this.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall direction looks good but this PR is combining too many independent changes which makes it harder to review and also means that if it happens to introduce a bug then a git bisect won't pin down the problem very precisely.

Could we break out the test suite improvements and merge that change first, and separate out the cosmetic variable renaming in MasterService too? That way we'll be left with a PR that just does the behaviour change.

@ywangd
Copy link
Member

ywangd commented May 13, 2024

+1 to David's suggestion. I assume this PR is meant to fix #107840? If so, can we make it clear in the description, i.e. replaces related with something like fixes so that we get GitHub to associate them?

@mhl-b
Copy link
Contributor Author

mhl-b commented May 13, 2024

@DaveCTurner @ywangd

Could we break out the test suite improvements and merge that change first

I will create separate PR for the test suite improvements, will keep this one open for behaviour change.

elasticsearchmachine pushed a commit that referenced this pull request May 14, 2024
…108589)

I observed that `testRegisterRepositorySuccessAfterCreationFailed` test
never invokes assertion blocks, because listener is not invoked.

There are 2 problems:

1. Test setup used mocks. Mocks interrupt listener chain propagation, so registerRepository never returned Response or Failure.
2. We silently ignore assertions in listener because it is not invoked. Test pass successfully.

PutRepositories method relies on cluster state update. I replace mocked
ClusterService and ThreadPool with test implementation of these. Also
add blocking call on listener to ensure we get result.

Address
[comment](#108531 (review))
to break down larger PR into smaller pieces in #108531
@mhl-b mhl-b changed the title Validate repository before cluster update Unregister repository when PutRepository validation fails May 15, 2024
@mhl-b mhl-b changed the title Unregister repository when PutRepository validation fails Unregister repository when PutRepository verification fails May 16, 2024
@mhl-b
Copy link
Contributor Author

mhl-b commented May 16, 2024

@DaveCTurner @idegtiarenko
After try and err with enhancing validation I decided to change approach. I updated PR description and code. WDYT.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works unfortunately, at least not reliably enough. The unregisterRepository step may fail if, for instance, there's a master failover in the meantime, or something starts a snapshot operation in between the register and unregister steps. We need to do at least some kind of verification before adding the repo metadata to the cluster state.

@ywangd
Copy link
Member

ywangd commented May 16, 2024

In addition to @DaveCTurner's comment, I think this approach also means we will delete an existing repository if the update request has some invalid settings in it. Feels like a pretty big risk to do so.

@DaveCTurner
Copy link
Contributor

oof yeah that too!

@mhl-b
Copy link
Contributor Author

mhl-b commented May 16, 2024

Alright, back to verification :)

@mhl-b mhl-b changed the title Unregister repository when PutRepository verification fails Verify repository before cluster update May 16, 2024
@mhl-b mhl-b requested a review from DaveCTurner May 17, 2024 02:54
@mhl-b
Copy link
Contributor Author

mhl-b commented May 17, 2024

@DaveCTurner @ywangd
updated PR, using pre-publish verification now, found and fixed complaining tests

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment. Thanks!

Comment on lines 172 to 173
if (request.verify()) {
validatePutRepositoryRequest(request, validationStep);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this done together with validateRepositoryCanBeCreated? I think it feels nature that pre-cluster-state-update validation/verification is done in one place. Also seems to be wasteful to create the repository twice.

While I appreciate that SubscribableListener generally makes code easier to read, can we keep it separate from this PR? If we move this step into validateRepositoryCanBeCreated, I think we don't need to touch the code here at all in this PR which would make it much easier to review. We can always have a follow-up which will be pure-refactoring for introucing SubscribableListener.

Copy link
Contributor Author

@mhl-b mhl-b May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validateRepositoryCanBeCreated method is used in other places, I would rather remove it from here. But it breaks unrelated tests. I can try to chase these down in this PR, or following PR.

Another problem is behaviour change, validateRepositoryCanBeCreated runs despite "verify" flag, but repository verification behind "verify". So putting everything behind verify or including new verification logic regardless "verify" breaks another set of tests including bwc.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.

I think it is better to keep functional change and style change separate unless it is very trivial. If somehow we need to revert the functional change, it is also much cleaner if they are separate. This likely won't happen for this PR. But it feels like a overall good principle.

validateRepositoryCanBeCreated method is used in other places

In that case, we don't have to literally move the new code into it. It can sit besides it as a new method. My main point is to not touch the below steps of listeners in this PR.

Let's hear what @DaveCTurner has to say about this before changing anything. Thanks!

Copy link
Member

@ywangd ywangd May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the SubscribableListener change does not have to come after this PR. It would be totally reasonable to have a separate PR just for that and merge it before proceeding here. My main point is to have them separate, order is not important.

Copy link
Contributor Author

@mhl-b mhl-b May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right about splitting it apart, I talked with David before about it, but got tunnel-visioned chasing down test failures and forgot about it. I rethink what I said before, and I believe all of what you mentioned above can be addressed.

I created PR with non functional change - #108788

I have some thoughts how to organize validateRepositoryCanBeCreated better, will update this PR after refactoring merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. That was my main gripe :) Thanks for splitting it out. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR with refactoring included

elasticsearchmachine pushed a commit that referenced this pull request May 17, 2024
This PR is a syntactic change for `registerRepository` in
`RepositoriesService`. I use `SubscribableListener` to display order of
events and reduce boilerplate code around failures delegation
`listener.delegateFailureAndWrap`.

It's a part of larger change for verification logic, which should take
advantage of this "sequential" version of code. #108531
@mhl-b mhl-b requested review from a team as code owners May 21, 2024 22:41
@mhl-b mhl-b requested review from ywangd and removed request for a team May 22, 2024 00:13
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go except a couple of superficial points, see below.

Comment on lines +359 to +372
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
try {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
resultListener.onResponse(null);
} catch (Exception e) {
resultListener.onFailure(e);
} finally {
closeRepository(repository);
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's generally best to avoid submitting a bare Runnable to threadpool executors, because they don't handle rejection very well. In this particular case it'd be ok since (a) SNAPSHOT doesn't reject anything (today) and (b) the whole thing is in a try/catch block anyway, but still these things are easy to miss in future changes in this area. AbstractRunnable is always a better choice:

Suggested change
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> {
try {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
resultListener.onResponse(null);
} catch (Exception e) {
resultListener.onFailure(e);
} finally {
closeRepository(repository);
}
});
threadPool.executor(ThreadPool.Names.SNAPSHOT)
.execute(ActionRunnable.run(ActionListener.runBefore(resultListener, () -> closeRepository(repository)), () -> {
final var token = repository.startVerification();
if (token != null) {
repository.verify(token, clusterService.localNode());
repository.endVerification(token);
}
}));

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
}
}

public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
return safeAwait(SubscribableListener.newForked(exceptionListener -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary {...}, can just be an expression lambda

Suggested change
return safeAwait(SubscribableListener.newForked(exceptionListener -> {
return safeAwait(SubscribableListener.newForked(exceptionListener ->

@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) {
}
}

public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't use the generic type param T anywhere we can just use a wildcard instead:

Suggested change
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) {
public static Exception safeAwaitFailure(SubscribableListener<?> listener) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >enhancement Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.15.0
Projects
None yet
5 participants