-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hit unreachable code #488
Comments
Few observations:
It seems that when running with targets = throughput, connections and filtering the invocation to just connections/hbone, we still get an extra fetch_certificate_pri request (for the throughput part, I imagine). Can somebody explain how those tests result in calls to fetch_certificate[_pri] during initialization? As a side note, https://stackoverflow.com/questions/74243078/long-build-times-for-criterion-cargo-bench-but-not-cargo-build-release-with-la seems to be happening and is really annoying. |
SIGSEGV is #489. Probably related in some way. Something seems really weird here, hopefully its a benchmark runner oddity and not a real issue |
Unfortunately, reproduced in integration tests without the benchmark:
Will look further. |
Sleeping 1 second after
So arguably everything works as intended and the real problem is that the test doesn't properly wait for all work to complete (if nothing else, that's good for coverage). Given tokio provides no way to wait for all tasks to complete (only to yield), that pretty much means we should explicitly track all async work started. So TL;DR options we have are:
Note that depending on how things are implemented, the same problem may be affecting the main binary. |
I think new_with_client could be problematic. Inside a new tokio thread is spawned but this function drops the join handle returning only the client. pub fn new_with_client<C: 'static + CaClientTrait>(client: C) -> Self {
Self::new_internal(
Box::new(client),
SecretManagerConfig {
time_conv: crate::time::Converter::new(),
concurrency: 8,
},
)
.0 // <--- take the fist part of the returned tuple (Self) while dropping the second (JoinHandle)
}
fn new_internal(
client: Box<dyn CaClientTrait>,
cfg: SecretManagerConfig,
) -> (Self, tokio::task::JoinHandle<()>) { //... logic omitted
} Perhaps SecretManager could be adjusted to include the join handle for it's worker or maybe add a stop/shutdown closure which has that join handle and when called that closure will join. |
Alternative could be hide the implementation detail that there is a spawned thread inside |
But problem is the opposite - it's not SecretManager going out of scope too early, it's the worker dying while there are SecretManager references in use. Waiting on drop wouldn't help, since by the time it complains about unreachable code, drop hasn't even been called (it panics inside a SecretManager method).
I'd argue it's already hidden - if SecretManager is gone, it should exit peacefully wihout issues / side effects.
You need async to wait for a JoinHandle. While we could probably do something like save the Tokio runtime handle in new_with_client and use it to block_on the JoinHandle in drop, but that sounds pretty brittle. FWIW I wonder if new_with_client should be made async just because it uses tokio::spawn, but I don't recall seeking this type of pattern before. |
Thanks for the details @qfel. |
Possible test of your hypothesis:
Can you explicitly drop the test env which contains the SecretManager prior to dropping the runtime to ensure that it can be cleaned up before the runtime itself is dropped? This would allow dropping the Request send channel which trigger Worker to drain/shutdown prior to the runtime dropping. pub fn connections(c: &mut Criterion) {
let (env, rt) = initialize_environment(Mode::ReadWrite);
let mut c = c.benchmark_group("connections");
c.bench_function("direct", |b| { //...testing code omitted
}
mem::drop(env);
mem::drop(rt);
} |
As mentioned in my previous comment, sleeping after the the test (but before drop/return) helps, so that's one form of semi-verification. Explicitly dropping env first doesn't help, but neither it needs to - nothing stops something from spawning separate tasks that call fetch_certificate_pri and survive env drop. At the very least I've found:
If we want to do proper cleanup anyway, that's one way to start addressing the issue here. If we don't - I guess we can silence the error, but I'm not sure it's worth going into the rabbit hole of addressing task leakage just to verify why the test fails. Basically if we don't care about cleanup, then we shouldn't care about errors that can be caused by the lack of cleanup. BTW the last link creates a new runtime - it seems like fetch_certificate_pri calls are made from a different runtime than the one running Worker::run. |
I had hoped that dropping the test env might avoid falling into the default arm of the Agree that if it can be shown that the issue is purely about cleanup ordering in tests/benches then errors during cleanup are OK, at least in the short term, and a quicker fix is likely fine. An issue can be created to track organizing cleanup at a lower priority than investigating hitting this panic code if folks want to eliminate cleanup errors in the medium/long term. |
From what I have seen, the main offender is actually app.rs. So you may see it in real world use too. |
from b32cc7a
cc @qfel
The text was updated successfully, but these errors were encountered: