Hit unreachable code #488

howardjohn · 2023-04-03T22:20:39Z

$ cargo bench connections/hbone -- --profile-time 10
    Finished bench [optimized] target(s) in 0.18s
     Running benches/throughput.rs (out/rust/release/deps/throughput-9d94640613568b5f)
thread 'ztunnel-proxy-2' panicked at 'internal error: entered unreachable code: SecretManager worker died: requests channel is closed', /usr/local/google/home/howardjohn/go/src/istio.io/ztunnel/src/identity/manager.rs:478:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Benchmarking connections/hbone: Profiling for 10.000 serror: bench failed, to rerun pass `--bench throughput`

Caused by:
  process didn't exit successfully: `/usr/local/google/home/howardjohn/go/src/istio.io/ztunnel/out/rust/release/deps/throughput-9d94640613568b5f connections/hbone --profile-time 10 --bench` (signal: 11, SIGSEGV: invalid memory reference)

from b32cc7a

cc @qfel

The text was updated successfully, but these errors were encountered:

qfel · 2023-04-04T02:06:49Z

Few observations:

The message above also claims SIGSEGV
Removing the offending unreachable! still causes SIGSEGV
It's not clear to me SIGSEGV actually happens vs the benchmark runner being bonkers
Can't reproduce it when removing initialization for cases we don't run (ie. reducing

ztunnel/benches/throughput.rs

Line 241 in b32cc7a

targets = latency, throughput, connections, metrics

to just "connections")
Looking at SecretManager code can't figure out how this can possible happen (save for bugs in the compiler and/or unsafe code)

It seems that when running with targets = throughput, connections and filtering the invocation to just connections/hbone, we still get an extra fetch_certificate_pri request (for the throughput part, I imagine).

Can somebody explain how those tests result in calls to fetch_certificate[_pri] during initialization?

As a side note, https://stackoverflow.com/questions/74243078/long-build-times-for-criterion-cargo-bench-but-not-cargo-build-release-with-la seems to be happening and is really annoying.

howardjohn · 2023-04-06T19:48:18Z

SIGSEGV is #489. Probably related in some way. Something seems really weird here, hopefully its a benchmark runner oddity and not a real issue

qfel · 2023-04-09T05:57:01Z

Unfortunately, reproduced in integration tests without the benchmark:

#[test]
pub fn throughput() {
    let (env, rt) = initialize_environment(Mode::Read);
    let size = 10;
    rt.block_on(async {
        tcp::run_client(&mut env.lock().await.direct, size, Mode::Write).await
    }).unwrap();
}

Will look further.

qfel · 2023-04-09T06:56:42Z

Sleeping 1 second after block_on fixes the issue. I think I know what's happening (but it's a theory, I don't know the test app code well enough):

initialize_environment creates a multi-threaded tokio runtime
that means that after block_on returns, other tasks continue execution in worker threads (that includes SecretManager's Worker::run and something that calls fetch_certificate)
returning from the function drops the runtime, which causes tasks to die / cease executing
the task calling fetch_certificate_pri is running on a different thread than the one execuring Worker::run, so it has a chance to observe the Worker::run one being killed - and panics

So arguably everything works as intended and the real problem is that the test doesn't properly wait for all work to complete (if nothing else, that's good for coverage). Given tokio provides no way to wait for all tasks to complete (only to yield), that pretty much means we should explicitly track all async work started.

So TL;DR options we have are:

Return a regular error instead of panic
- That may still cause some concerning error logs during shutdown (since at SecretManager level we don't know we're shutting down, so can't distinguish the situation from a serious problem)
Implement a more controlled shutdown, where all users of SecretManager are terminated before the runtime itself is terminated

Note that depending on how things are implemented, the same problem may be affecting the main binary.

ilrudie · 2023-04-09T13:25:32Z

I think new_with_client could be problematic. Inside a new tokio thread is spawned but this function drops the join handle returning only the client.

    pub fn new_with_client<C: 'static + CaClientTrait>(client: C) -> Self {
        Self::new_internal(
            Box::new(client),
            SecretManagerConfig {
                time_conv: crate::time::Converter::new(),
                concurrency: 8,
            },
        )
        .0 // <--- take the fist part of the returned tuple (Self) while dropping the second (JoinHandle)
    }
    fn new_internal(
        client: Box<dyn CaClientTrait>,
        cfg: SecretManagerConfig,
    ) -> (Self, tokio::task::JoinHandle<()>) { //... logic omitted
}

Perhaps SecretManager could be adjusted to include the join handle for it's worker or maybe add a stop/shutdown closure which has that join handle and when called that closure will join.

ilrudie · 2023-04-09T15:43:42Z

Alternative could be hide the implementation detail that there is a spawned thread inside SecretManager by implementing drop for SecretManager such that when the type is being cleaned up the internal thread will be asked to stop and join will occur.

qfel · 2023-04-09T16:39:18Z

Perhaps SecretManager could be adjusted

But problem is the opposite - it's not SecretManager going out of scope too early, it's the worker dying while there are SecretManager references in use. Waiting on drop wouldn't help, since by the time it complains about unreachable code, drop hasn't even been called (it panics inside a SecretManager method).

Alternative could be hide the implementation detail that there is a spawned thread inside SecretManager

I'd argue it's already hidden - if SecretManager is gone, it should exit peacefully wihout issues / side effects.

by implementing drop for SecretManager

You need async to wait for a JoinHandle. While we could probably do something like save the Tokio runtime handle in new_with_client and use it to block_on the JoinHandle in drop, but that sounds pretty brittle.

FWIW I wonder if new_with_client should be made async just because it uses tokio::spawn, but I don't recall seeking this type of pattern before.

ilrudie · 2023-04-09T18:52:10Z

Thanks for the details @qfel.

ilrudie · 2023-04-09T19:26:55Z

Possible test of your hypothesis:

returning from the function drops the runtime, which causes tasks to die / cease executing

the task calling fetch_certificate_pri is running on a different thread than the one execuring Worker::run, so it has a chance to observe the Worker::run one being killed - and panics

Can you explicitly drop the test env which contains the SecretManager prior to dropping the runtime to ensure that it can be cleaned up before the runtime itself is dropped? This would allow dropping the Request send channel which trigger Worker to drain/shutdown prior to the runtime dropping.

pub fn connections(c: &mut Criterion) {
    let (env, rt) = initialize_environment(Mode::ReadWrite);
    let mut c = c.benchmark_group("connections");
    c.bench_function("direct", |b| { //...testing code omitted
    }
    mem::drop(env);
    mem::drop(rt);
}

qfel · 2023-04-10T04:34:16Z

As mentioned in my previous comment, sleeping after the the test (but before drop/return) helps, so that's one form of semi-verification. Explicitly dropping env first doesn't help, but neither it needs to - nothing stops something from spawning separate tasks that call fetch_certificate_pri and survive env drop. At the very least I've found:

ztunnel/src/app.rs

Lines 84 to 86 in 00c6804

    
           admin_server.spawn(); 
        
           stats_server.spawn(); 
        
           tokio::spawn(workload_manager.run());

ztunnel/src/app.rs

Line 90 in 00c6804

thread::spawn(move || {

If we want to do proper cleanup anyway, that's one way to start addressing the issue here. If we don't - I guess we can silence the error, but I'm not sure it's worth going into the rabbit hole of addressing task leakage just to verify why the test fails.

Basically if we don't care about cleanup, then we shouldn't care about errors that can be caused by the lack of cleanup.

BTW the last link creates a new runtime - it seems like fetch_certificate_pri calls are made from a different runtime than the one running Worker::run.

ilrudie · 2023-04-10T13:20:34Z

I had hoped that dropping the test env might avoid falling into the default arm of the tokio::select! in fn wait by allowing rx.changed() to error with forgotten.

Agree that if it can be shown that the issue is purely about cleanup ordering in tests/benches then errors during cleanup are OK, at least in the short term, and a quicker fix is likely fine. An issue can be created to track organizing cleanup at a lower priority than investigating hitting this panic code if folks want to eliminate cleanup errors in the medium/long term.

qfel · 2023-04-15T03:27:25Z

cleanup ordering in tests/benches

From what I have seen, the main offender is actually app.rs. So you may see it in real world use too.

qfel self-assigned this Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hit unreachable code #488

Hit unreachable code #488

howardjohn commented Apr 3, 2023

qfel commented Apr 4, 2023

howardjohn commented Apr 6, 2023

qfel commented Apr 9, 2023

qfel commented Apr 9, 2023

ilrudie commented Apr 9, 2023 •

edited

ilrudie commented Apr 9, 2023

qfel commented Apr 9, 2023

ilrudie commented Apr 9, 2023

ilrudie commented Apr 9, 2023

qfel commented Apr 10, 2023

ilrudie commented Apr 10, 2023

qfel commented Apr 15, 2023

Hit unreachable code #488

Hit unreachable code #488

Comments

howardjohn commented Apr 3, 2023

qfel commented Apr 4, 2023

howardjohn commented Apr 6, 2023

qfel commented Apr 9, 2023

qfel commented Apr 9, 2023

ilrudie commented Apr 9, 2023 • edited

ilrudie commented Apr 9, 2023

qfel commented Apr 9, 2023

ilrudie commented Apr 9, 2023

ilrudie commented Apr 9, 2023

qfel commented Apr 10, 2023

ilrudie commented Apr 10, 2023

qfel commented Apr 15, 2023

ilrudie commented Apr 9, 2023 •

edited