New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify repository before cluster update #108531
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
Some integ tests in s3 and gcp are failing on new verification call. I assume they worked before because we updated cluster first and repository was there, not verified. I'm looking how to fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall direction looks good but this PR is combining too many independent changes which makes it harder to review and also means that if it happens to introduce a bug then a git bisect
won't pin down the problem very precisely.
Could we break out the test suite improvements and merge that change first, and separate out the cosmetic variable renaming in MasterService
too? That way we'll be left with a PR that just does the behaviour change.
+1 to David's suggestion. I assume this PR is meant to fix #107840? If so, can we make it clear in the description, i.e. replaces |
server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/repositories/RepositoriesServiceTests.java
Outdated
Show resolved
Hide resolved
I will create separate PR for the test suite improvements, will keep this one open for behaviour change. |
…108589) I observed that `testRegisterRepositorySuccessAfterCreationFailed` test never invokes assertion blocks, because listener is not invoked. There are 2 problems: 1. Test setup used mocks. Mocks interrupt listener chain propagation, so registerRepository never returned Response or Failure. 2. We silently ignore assertions in listener because it is not invoked. Test pass successfully. PutRepositories method relies on cluster state update. I replace mocked ClusterService and ThreadPool with test implementation of these. Also add blocking call on listener to ensure we get result. Address [comment](#108531 (review)) to break down larger PR into smaller pieces in #108531
@DaveCTurner @idegtiarenko |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works unfortunately, at least not reliably enough. The unregisterRepository
step may fail if, for instance, there's a master failover in the meantime, or something starts a snapshot operation in between the register and unregister steps. We need to do at least some kind of verification before adding the repo metadata to the cluster state.
In addition to @DaveCTurner's comment, I think this approach also means we will delete an existing repository if the update request has some invalid settings in it. Feels like a pretty big risk to do so. |
oof yeah that too! |
Alright, back to verification :) |
@DaveCTurner @ywangd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment. Thanks!
if (request.verify()) { | ||
validatePutRepositoryRequest(request, validationStep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have this done together with validateRepositoryCanBeCreated
? I think it feels nature that pre-cluster-state-update validation/verification is done in one place. Also seems to be wasteful to create the repository twice.
While I appreciate that SubscribableListener
generally makes code easier to read, can we keep it separate from this PR? If we move this step into validateRepositoryCanBeCreated
, I think we don't need to touch the code here at all in this PR which would make it much easier to review. We can always have a follow-up which will be pure-refactoring for introucing SubscribableListener
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validateRepositoryCanBeCreated
method is used in other places, I would rather remove it from here. But it breaks unrelated tests. I can try to chase these down in this PR, or following PR.
Another problem is behaviour change, validateRepositoryCanBeCreated runs despite "verify" flag, but repository verification behind "verify". So putting everything behind verify or including new verification logic regardless "verify" breaks another set of tests including bwc.
About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About SubscribableListener, I found without little refactor it looks worse. May be diff will be a bit prettier, but final code not. I can do "ugly" change and then refactor in another PR if its a preferable way.
I think it is better to keep functional change and style change separate unless it is very trivial. If somehow we need to revert the functional change, it is also much cleaner if they are separate. This likely won't happen for this PR. But it feels like a overall good principle.
validateRepositoryCanBeCreated method is used in other places
In that case, we don't have to literally move the new code into it. It can sit besides it as a new method. My main point is to not touch the below steps of listeners in this PR.
Let's hear what @DaveCTurner has to say about this before changing anything. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the SubscribableListener
change does not have to come after this PR. It would be totally reasonable to have a separate PR just for that and merge it before proceeding here. My main point is to have them separate, order is not important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right about splitting it apart, I talked with David before about it, but got tunnel-visioned chasing down test failures and forgot about it. I rethink what I said before, and I believe all of what you mentioned above can be addressed.
I created PR with non functional change - #108788
I have some thoughts how to organize validateRepositoryCanBeCreated
better, will update this PR after refactoring merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. That was my main gripe :) Thanks for splitting it out. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated PR with refactoring included
This PR is a syntactic change for `registerRepository` in `RepositoriesService`. I use `SubscribableListener` to display order of events and reduce boilerplate code around failures delegation `listener.delegateFailureAndWrap`. It's a part of larger change for verification logic, which should take advantage of this "sequential" version of code. #108531
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to go except a couple of superficial points, see below.
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> { | ||
try { | ||
final var token = repository.startVerification(); | ||
if (token != null) { | ||
repository.verify(token, clusterService.localNode()); | ||
repository.endVerification(token); | ||
} | ||
resultListener.onResponse(null); | ||
} catch (Exception e) { | ||
resultListener.onFailure(e); | ||
} finally { | ||
closeRepository(repository); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally best to avoid submitting a bare Runnable
to threadpool executors, because they don't handle rejection very well. In this particular case it'd be ok since (a) SNAPSHOT
doesn't reject anything (today) and (b) the whole thing is in a try/catch block anyway, but still these things are easy to miss in future changes in this area. AbstractRunnable
is always a better choice:
threadPool.executor(ThreadPool.Names.SNAPSHOT).execute(() -> { | |
try { | |
final var token = repository.startVerification(); | |
if (token != null) { | |
repository.verify(token, clusterService.localNode()); | |
repository.endVerification(token); | |
} | |
resultListener.onResponse(null); | |
} catch (Exception e) { | |
resultListener.onFailure(e); | |
} finally { | |
closeRepository(repository); | |
} | |
}); | |
threadPool.executor(ThreadPool.Names.SNAPSHOT) | |
.execute(ActionRunnable.run(ActionListener.runBefore(resultListener, () -> closeRepository(repository)), () -> { | |
final var token = repository.startVerification(); | |
if (token != null) { | |
repository.verify(token, clusterService.localNode()); | |
repository.endVerification(token); | |
} | |
})); |
@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) { | |||
} | |||
} | |||
|
|||
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) { | |||
return safeAwait(SubscribableListener.newForked(exceptionListener -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: unnecessary {...}
, can just be an expression lambda
return safeAwait(SubscribableListener.newForked(exceptionListener -> { | |
return safeAwait(SubscribableListener.newForked(exceptionListener -> |
@@ -2178,6 +2179,12 @@ public static <T> T safeGet(Future<T> future) { | |||
} | |||
} | |||
|
|||
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't use the generic type param T
anywhere we can just use a wildcard instead:
public static <T> Exception safeAwaitFailure(SubscribableListener<T> listener) { | |
public static Exception safeAwaitFailure(SubscribableListener<?> listener) { |
This PR addresses issue when we can create "invalid" repository and persist it
in cluster state. Even thou we return error to caller, repository stays there.
Add repository verification only on master node before cluster update.
Refactored registerRepository code with SubscribableListener to display order
of listener calls.
fixes #107840