Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NBFTReliability #401

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

NBFTReliability #401

wants to merge 18 commits into from

Conversation

OlivierHecart
Copy link
Contributor

No description provided.

@p-avital
Copy link
Contributor

For now, my main blocker is not code (I haven't read much of it yet), but lack of documentation:

  • What does NBFT mean?
  • What are its contracts? (What problems are solved, under what conditions)
  • How should it be used?
    • How many caches?
    • How should the caches be configured?

The filenames are also a bit awkward, why not group all NBFT things into a directory rather than prefixed files (at least for the ones in src)?

For the builder, I would ask why they don't just extend the standard builders, but I guess it's mostly because you might need access to data that the builders keep private. If possible, it would be nice to add a small trait to turn do sub_builder.reliable() and obtain the NFTReliableSubscriberBuilder. I only see this as possible if we can have the reliable builders have a SubscriberBuilder field, which might not be possible.

This last comment makes me think: maybe it would be useful to have a (forever unstable) way to break a builder into its component parts, so that zenoh-ext would less often need to re-implement the builders fully just to add one option.

@OlivierHecart
Copy link
Contributor Author

For now, my main blocker is not code (I haven't read much of it yet), but lack of documentation.

Most of the answers (if they exist) are there: https://github.com/eclipse-zenoh/roadmap/blob/main/rfcs/ALL/Non%20Blocking%20Fault%20Tolerant%20Reliability.md
Should we point there from the rustdoc ?

The filenames are also a bit awkward, why not group all NBFT things into a directory rather than prefixed files (at least for the ones in src)?

Sure

For the builder, I would ask why they don't just extend the standard builders, but I guess it's mostly because you might need access to data that the builders keep private.

That's indeed the main reason.

let _cache = session
.declare_reliability_cache(key_expr)
.history(history)
.queryable_prefix(prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does queryable_prefix do? In what does it differ from the key_expr used in the declare_reliability_cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described here data published on key k by a publisher with id <publisher_id>, will be made available by the cache for queries on <publisher_id>/k. So by changing this prefix you indicate for what publishers this cache is storing data (see here). Typically the cache attached to a NFTReliablePublisher will have the publisher id as prefix to only store data from this publisher while a cache deployed in the infrastructure will have a * prefix to store data from all publishers.
The code was partially inspired from the existing publication cache and names were taken from there. But I agree better names could be found.

println!("Declaring NBFTReliabilityCache on {}", key_expr);
let _cache = session
.declare_reliability_cache(key_expr)
.history(history)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If history is not configured, what's the default? Is it mandatory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default history is 1024

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the reliability cache is really dependent on the use case and should be configured accordingly. This will force the user to think how big or small it has to be. Similar thing applies for the publisher and subscriber. Should we make it mandatory?

println!("Declaring NBFTReliablePublisher on {}", &key_expr);
let publ = session
.declare_reliable_publisher(&key_expr)
.with_cache(cache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I was expecting with_cache to accept a cache object created by declare_reliability_cache. Instead, it seems to take a bool. This is a bit counter-intuitive for the with_X pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cache associated to a publisher will be preconfigured for it. It will typically have the publisher id as queryable prefix. So taking a cache declared independently does not seem the best way to me. Still a better name for the function could be found...

let publ = session
.declare_reliable_publisher(&key_expr)
.with_cache(cache)
.history(history)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If history is not configured, what's the default? Is it mandatory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default history is 1024

println!("Declaring NBFTReliableSubscriber on {}", key_expr);
let subscriber = session
.declare_reliable_subscriber(key_expr)
.history(history)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If history is not configured, what's the default? Is it mandatory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is false.

While the history function of NBFTReliabilityCache and NBFTReliablePublisher builders accept an integer and configures the depth of the reliability cache in number of samples for each key, the history function of NBFTReliableSubscriber builder defines if yes or no the NBFTReliableSubscriber should query for historical data at startup. This is indeed a bit confusing. one of them should be renamed.

zenoh-ext/src/nbftreliability_cache.rs Outdated Show resolved Hide resolved
zenoh-ext/src/nbftreliable_subscriber.rs Show resolved Hide resolved
.accept_replies(ReplyKeyExpr::Any)
.target(self.query_target)
.timeout(self.query_timeout)
.res_sync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be res_async since it executed within async fn run.

.callback({
move |r: Reply| {
if let Ok(s) = r.sample {
let (ref mut states, wait) = &mut *zlock!(handler.statesref);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use a sync lock here instead of an async one? In case a lock needs to be used both in async and sync context I would use an async lock and use task::block_on in the sync context. In this way we avoid to potentially block the executor in the async context.

move |r: Reply| {
if let Ok(s) = r.sample {
let (ref mut states, wait) =
&mut *zlock!(handler.statesref);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that here the lock is taken and kept until handle_sample has finished. However, handle_sample calls a callback while keeping the lock. Shouldn't the lock be release before calling the callback?

@evshary
Copy link
Contributor

evshary commented Sep 1, 2023

@OlivierHecart Since it's been a while ago, is the PR still alive? As a guard this week, just want to make sure no PR is forgotten. 😃

@fuzzypixelz
Copy link
Member

@OlivierHecart Please change your pull request's base branch to main (new default branch). And rebase your branch against main as it is missing a status check necessary to merge this pull request but which is only available on main.

@milyin milyin mentioned this pull request Jan 25, 2024
@imstevenpmwork
Copy link
Contributor

imstevenpmwork commented Mar 14, 2024

@milyin I see this PR was mentioned in #669 which is part of the next release. Would you mind linking this PR to the issue and adding it to the project roadmap? Bonus points if you use the release tag and set a timeline :D

@milyin
Copy link
Contributor

milyin commented Mar 15, 2024

@milyin I see this PR was mentioned in #669 which is part of the next release. Would you mind linking this PR to the issue and adding it to the project roadmap? Bonus points if you use the release tag and set a timeline :D

This implementation is going to be rewritten accordingly to requirements in #669. This is definitely going to be started after the release, currently we don't have time to it. Converting the PR to draft

@milyin milyin marked this pull request as draft March 15, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants