Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds-failover: adding main logic #34215

Merged
merged 12 commits into from
May 29, 2024
Merged

Conversation

adisuissa
Copy link
Contributor

Commit Message: xds-failover: adding main logic
Additional Description:
Adding the main logic for the xDS-failover support.
It currently supports a primary and failover sources. Initially, the manager attempts to connect to the primary twice, and if both fail, alternates between the failover and the primary until one is successfully connected to.
Once a connection is made (a response arrives) from either source, the failover manager will stick to that source.

Risk Level: low - not plumbed into any GrpcMux object yet.
Testing: Added unit test.
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #34215 was opened by adisuissa.

see: more, trace.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa adisuissa marked this pull request as ready for review May 17, 2024 03:22
@adisuissa
Copy link
Contributor Author

Harvey PTAL. LMK if you want more context (I've got an internal integration test the exercises this code).
/assign @htuch

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM modulo a few issues commented on.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa
Copy link
Contributor Author

Assigning @soulxu as !Google maintainer reviewer.
/assign @soulxu

@adisuissa
Copy link
Contributor Author

Unassigning @soulxu as he is OOO this week.
@wbpcode can you please take a ~Google pass at this PR?

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
@adisuissa
Copy link
Contributor Author

@nezdolik will you have some bandwidth to review this PR as a !Google reviewer?
Thanks in advance!

ENVOY_LOG_MISC(trace, "Attempting to reconnect to the failover gRPC source, as a connection "
"to it was previously established.");
if (connected_to_ != ConnectedState::Failover) {
ASSERT(connected_to_ == ConnectedState::None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to explicitly set connecting_to_primary_ = false; here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert, as connecting_to_primary_ should always be false after being connected to failover. Added a similar assert for the primary part above.

}

virtual ~GrpcMuxFailover() = default;

// Attempts to establish a new stream to the either the primary or failover source.
void establishNewStream() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method has certain degree of code duplication, wonder if it would be possible to refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added functions that abstract the calls. I think it is still more readable/understandable if the "ever-connected" parts are done before the rest of the code.

// determined similar to the primary variants.
// Note that while Envoy can only be connected to a single source (mutually
// exclusive), it can attempt connecting to more than one source at a time.
bool connecting_to_primary_ : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of using bit fields here and not simply initilialise like:

bool connecting_to_primary_  {false};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force of habit (to keep bools small, which is good when there are many instances), but I agree that it makes little sense here.
Modified as you suggested.

ConnectedState connected_to_;

// A flag that keeps track of whether Envoy successfully connected to either the
// primary or failover source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also clarify that successful connection here means "receiving discovery response from xds server"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note.

return;
}
// No prior connection was established, prefer the primary over the failover.
if (connecting_to_failover_ || (connected_to_ == ConnectedState::Failover)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused how (connected_to_ == ConnectedState::Failover) could evaluate to true here, e.g. the code execution would get here if ever_connected_to_ is None and there is no established stream

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, it cannot. Removed that part.

failover_grpc_stream_->sendMessage(request);
return;
}
// Either connecting/connected to the primary, or no connection was attempted.
primary_grpc_stream_->sendMessage(request);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check here if we are connected to primary prior to sending? Or would it just produce stream error logs when not connected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep it the same as GrpcStream. We may want to ensure that in the future, but for now I think we can keep parity.

// connection was closed.
ASSERT((parent_.connecting_to_primary_ || parent_.connected_to_ == ConnectedState::Primary) &&
!parent_.connecting_to_failover_ &&
(parent_.connected_to_ != ConnectedState::Failover));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to repeat the parent_.connected_to_ != ConnectedState::Failover assertion here? since it is also being evaluated in the line above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. Removed the prior ASSERT (same for the failover part).

// TODO(adisuissa): At the moment this is a pass-through method. Once the
// implementation matures, this call will be updated.
ASSERT(
(parent_.connecting_to_primary_ || (parent_.connected_to_ == ConnectedState::Primary)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this bool expression (and maybe similar ones) appear frequently in the code:
(parent_.connecting_to_primary_ || (parent_.connected_to_ == ConnectedState::Primary), it could benefit from extracting it into small utility method like isConnectingToOrConnectedToPrimary()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!
I've refactored these and the similar primary counterparts into a function.

Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Signed-off-by: Adi Suissa-Peleg <adip@google.com>
Copy link
Contributor Author

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great comments @nezdolik!

}

virtual ~GrpcMuxFailover() = default;

// Attempts to establish a new stream to the either the primary or failover source.
void establishNewStream() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added functions that abstract the calls. I think it is still more readable/understandable if the "ever-connected" parts are done before the rest of the code.

ENVOY_LOG_MISC(trace, "Attempting to reconnect to the failover gRPC source, as a connection "
"to it was previously established.");
if (connected_to_ != ConnectedState::Failover) {
ASSERT(connected_to_ == ConnectedState::None);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert, as connecting_to_primary_ should always be false after being connected to failover. Added a similar assert for the primary part above.

return;
}
// No prior connection was established, prefer the primary over the failover.
if (connecting_to_failover_ || (connected_to_ == ConnectedState::Failover)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, it cannot. Removed that part.

failover_grpc_stream_->sendMessage(request);
return;
}
// Either connecting/connected to the primary, or no connection was attempted.
primary_grpc_stream_->sendMessage(request);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep it the same as GrpcStream. We may want to ensure that in the future, but for now I think we can keep parity.

// connection was closed.
ASSERT((parent_.connecting_to_primary_ || parent_.connected_to_ == ConnectedState::Primary) &&
!parent_.connecting_to_failover_ &&
(parent_.connected_to_ != ConnectedState::Failover));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. Removed the prior ASSERT (same for the failover part).

// TODO(adisuissa): At the moment this is a pass-through method. Once the
// implementation matures, this call will be updated.
ASSERT(
(parent_.connecting_to_primary_ || (parent_.connected_to_ == ConnectedState::Primary)) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!
I've refactored these and the similar primary counterparts into a function.

// determined similar to the primary variants.
// Note that while Envoy can only be connected to a single source (mutually
// exclusive), it can attempt connecting to more than one source at a time.
bool connecting_to_primary_ : 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force of habit (to keep bools small, which is good when there are many instances), but I agree that it makes little sense here.
Modified as you suggested.

ConnectedState connected_to_;

// A flag that keeps track of whether Envoy successfully connected to either the
// primary or failover source.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note.

Copy link
Member

@nezdolik nezdolik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, exciting feature! There is also certain degree of code duplication across FailoverGrpcStreamCallbacks and PrimaryGrpcStreamCallbacks, but is also fine to leave it as it is if refactoring is too much effort.

@adisuissa
Copy link
Contributor Author

Thanks, exciting feature! There is also certain degree of code duplication across FailoverGrpcStreamCallbacks and PrimaryGrpcStreamCallbacks, but is also fine to leave it as it is if refactoring is too much effort.

Thanks!
I think that the primary and fallback will be quite different as we get to test this out and add some different behavior.
We can always refactor the code later, making them extend the same class or something.

@adisuissa
Copy link
Contributor Author

cc @htuch for final approval and merge

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@htuch htuch merged commit f7952ef into envoyproxy:main May 29, 2024
52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants