Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hit unreachable code #488

Open
howardjohn opened this issue Apr 3, 2023 · 12 comments
Open

Hit unreachable code #488

howardjohn opened this issue Apr 3, 2023 · 12 comments
Assignees

Comments

@howardjohn
Copy link
Member

$ cargo bench connections/hbone -- --profile-time 10
    Finished bench [optimized] target(s) in 0.18s
     Running benches/throughput.rs (out/rust/release/deps/throughput-9d94640613568b5f)
thread 'ztunnel-proxy-2' panicked at 'internal error: entered unreachable code: SecretManager worker died: requests channel is closed', /usr/local/google/home/howardjohn/go/src/istio.io/ztunnel/src/identity/manager.rs:478:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Benchmarking connections/hbone: Profiling for 10.000 serror: bench failed, to rerun pass `--bench throughput`

Caused by:
  process didn't exit successfully: `/usr/local/google/home/howardjohn/go/src/istio.io/ztunnel/out/rust/release/deps/throughput-9d94640613568b5f connections/hbone --profile-time 10 --bench` (signal: 11, SIGSEGV: invalid memory reference)

from b32cc7a

cc @qfel

@qfel qfel self-assigned this Apr 3, 2023
@qfel
Copy link
Contributor

qfel commented Apr 4, 2023

Few observations:

  • The message above also claims SIGSEGV
  • Removing the offending unreachable! still causes SIGSEGV
  • It's not clear to me SIGSEGV actually happens vs the benchmark runner being bonkers
  • Can't reproduce it when removing initialization for cases we don't run (ie. reducing
    targets = latency, throughput, connections, metrics
    to just "connections")
  • Looking at SecretManager code can't figure out how this can possible happen (save for bugs in the compiler and/or unsafe code)

It seems that when running with targets = throughput, connections and filtering the invocation to just connections/hbone, we still get an extra fetch_certificate_pri request (for the throughput part, I imagine).

Can somebody explain how those tests result in calls to fetch_certificate[_pri] during initialization?

As a side note, https://stackoverflow.com/questions/74243078/long-build-times-for-criterion-cargo-bench-but-not-cargo-build-release-with-la seems to be happening and is really annoying.

@howardjohn
Copy link
Member Author

SIGSEGV is #489. Probably related in some way. Something seems really weird here, hopefully its a benchmark runner oddity and not a real issue

@qfel
Copy link
Contributor

qfel commented Apr 9, 2023

Unfortunately, reproduced in integration tests without the benchmark:

#[test]
pub fn throughput() {
    let (env, rt) = initialize_environment(Mode::Read);
    let size = 10;
    rt.block_on(async {
        tcp::run_client(&mut env.lock().await.direct, size, Mode::Write).await
    }).unwrap();
}

Will look further.

@qfel
Copy link
Contributor

qfel commented Apr 9, 2023

Sleeping 1 second after block_on fixes the issue. I think I know what's happening (but it's a theory, I don't know the test app code well enough):

  • initialize_environment creates a multi-threaded tokio runtime
  • that means that after block_on returns, other tasks continue execution in worker threads (that includes SecretManager's Worker::run and something that calls fetch_certificate)
  • returning from the function drops the runtime, which causes tasks to die / cease executing
  • the task calling fetch_certificate_pri is running on a different thread than the one execuring Worker::run, so it has a chance to observe the Worker::run one being killed - and panics

So arguably everything works as intended and the real problem is that the test doesn't properly wait for all work to complete (if nothing else, that's good for coverage). Given tokio provides no way to wait for all tasks to complete (only to yield), that pretty much means we should explicitly track all async work started.

So TL;DR options we have are:

  • Return a regular error instead of panic
    • That may still cause some concerning error logs during shutdown (since at SecretManager level we don't know we're shutting down, so can't distinguish the situation from a serious problem)
  • Implement a more controlled shutdown, where all users of SecretManager are terminated before the runtime itself is terminated

Note that depending on how things are implemented, the same problem may be affecting the main binary.

@ilrudie
Copy link
Contributor

ilrudie commented Apr 9, 2023

I think new_with_client could be problematic. Inside a new tokio thread is spawned but this function drops the join handle returning only the client.

    pub fn new_with_client<C: 'static + CaClientTrait>(client: C) -> Self {
        Self::new_internal(
            Box::new(client),
            SecretManagerConfig {
                time_conv: crate::time::Converter::new(),
                concurrency: 8,
            },
        )
        .0 // <--- take the fist part of the returned tuple (Self) while dropping the second (JoinHandle)
    }
    fn new_internal(
        client: Box<dyn CaClientTrait>,
        cfg: SecretManagerConfig,
    ) -> (Self, tokio::task::JoinHandle<()>) { //... logic omitted
}

Perhaps SecretManager could be adjusted to include the join handle for it's worker or maybe add a stop/shutdown closure which has that join handle and when called that closure will join.

@ilrudie
Copy link
Contributor

ilrudie commented Apr 9, 2023

Alternative could be hide the implementation detail that there is a spawned thread inside SecretManager by implementing drop for SecretManager such that when the type is being cleaned up the internal thread will be asked to stop and join will occur.

@qfel
Copy link
Contributor

qfel commented Apr 9, 2023

Perhaps SecretManager could be adjusted

But problem is the opposite - it's not SecretManager going out of scope too early, it's the worker dying while there are SecretManager references in use. Waiting on drop wouldn't help, since by the time it complains about unreachable code, drop hasn't even been called (it panics inside a SecretManager method).

Alternative could be hide the implementation detail that there is a spawned thread inside SecretManager

I'd argue it's already hidden - if SecretManager is gone, it should exit peacefully wihout issues / side effects.

by implementing drop for SecretManager

You need async to wait for a JoinHandle. While we could probably do something like save the Tokio runtime handle in new_with_client and use it to block_on the JoinHandle in drop, but that sounds pretty brittle.

FWIW I wonder if new_with_client should be made async just because it uses tokio::spawn, but I don't recall seeking this type of pattern before.

@ilrudie
Copy link
Contributor

ilrudie commented Apr 9, 2023

Thanks for the details @qfel.

@ilrudie
Copy link
Contributor

ilrudie commented Apr 9, 2023

Possible test of your hypothesis:

  • returning from the function drops the runtime, which causes tasks to die / cease executing
  • the task calling fetch_certificate_pri is running on a different thread than the one execuring Worker::run, so it has a chance to observe the Worker::run one being killed - and panics

Can you explicitly drop the test env which contains the SecretManager prior to dropping the runtime to ensure that it can be cleaned up before the runtime itself is dropped? This would allow dropping the Request send channel which trigger Worker to drain/shutdown prior to the runtime dropping.

pub fn connections(c: &mut Criterion) {
    let (env, rt) = initialize_environment(Mode::ReadWrite);
    let mut c = c.benchmark_group("connections");
    c.bench_function("direct", |b| { //...testing code omitted
    }
    mem::drop(env);
    mem::drop(rt);
}

@qfel
Copy link
Contributor

qfel commented Apr 10, 2023

As mentioned in my previous comment, sleeping after the the test (but before drop/return) helps, so that's one form of semi-verification. Explicitly dropping env first doesn't help, but neither it needs to - nothing stops something from spawning separate tasks that call fetch_certificate_pri and survive env drop. At the very least I've found:

If we want to do proper cleanup anyway, that's one way to start addressing the issue here. If we don't - I guess we can silence the error, but I'm not sure it's worth going into the rabbit hole of addressing task leakage just to verify why the test fails.

Basically if we don't care about cleanup, then we shouldn't care about errors that can be caused by the lack of cleanup.

BTW the last link creates a new runtime - it seems like fetch_certificate_pri calls are made from a different runtime than the one running Worker::run.

@ilrudie
Copy link
Contributor

ilrudie commented Apr 10, 2023

I had hoped that dropping the test env might avoid falling into the default arm of the tokio::select! in fn wait by allowing rx.changed() to error with forgotten.

Agree that if it can be shown that the issue is purely about cleanup ordering in tests/benches then errors during cleanup are OK, at least in the short term, and a quicker fix is likely fine. An issue can be created to track organizing cleanup at a lower priority than investigating hitting this panic code if folks want to eliminate cleanup errors in the medium/long term.

@qfel
Copy link
Contributor

qfel commented Apr 15, 2023

cleanup ordering in tests/benches

From what I have seen, the main offender is actually app.rs. So you may see it in real world use too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants