Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dying Pods break NetworkChaos (and perhaps others that inject into the pid) #1446

Open
torblerone opened this issue Jan 22, 2021 · 8 comments
Assignees

Comments

@torblerone
Copy link

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.9", GitCommit:"4fb7ed12476d57b8437ada90b4f93b17ffaeed99", GitTreeState:"clean", BuildDate:"2020-07-15T16:10:45Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

What version of Chaos Mesh are you using?

Controller manager Version: version.Info{GitVersion:"v1.1.0", GitCommit:"eeeba39ad758f2463e8ca83a704a06c8957ebc01", BuildDate:"2021-01-08T13:15:03Z", GoVersion:"go1.14.6", Compiler:"gc", Platform:"linux/amd64"}

What did you do?
I've created three types of Chaos experiments: Pod Kill, Pod Failure and NetworkChaos.

What did you expect to see?
I expected that the daemons somehow talk to each other and inform themselves about their targets. Or at least the controller has some kind of overview of (accessible) targets.

What did you see instead?
When NetworkChaos is trying to run to a specified target (a random set of pods of a namespace-deployment-combination), but at quite the same time PodKill or PodFailure is running (also a random set of pods of the same namespace-deployment-combination) and kills the Pods that NetworkChaos was trying to target, the NetworkChaos experiment is being broken that much that I have to re-apply the experiment YML. Otherwise, the experiment always tries to run on the same Pods from the last run (which already died from the PodKill).

The dashboard shows the following error on the NetworkChaos experiment:

An error occurred: 1 error occurred: * admission webhook "vpodnetworkchaos.kb.io" denied the request: rpc error: code = Unknown desc = error code: exit status 101, msg: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Sys(ENOENT)', src/main.rs:75:61 stack backtrace: 0: backtrace::backtrace::libunwind::trace at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86 1: backtrace::backtrace::trace_unsynchronized at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66 2: std::sys_common::backtrace::_print_fmt at src/libstd/sys_common/backtrace.rs:78 3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt at src/libstd/sys_common/backtrace.rs:59 4: core::fmt::write at src/libcore/fmt/mod.rs:1076 5: std::io::Write::write_fmt at src/libstd/io/mod.rs:1537 6: std::sys_common::backtrace::_print at src/libstd/sys_common/backtrace.rs:62 7: std::sys_common::backtrace::print at src/libstd/sys_common/backtrace.rs:49 8: std::panicking::default_hook::{{closure}} at src/libstd/panicking.rs:198 9: std::panicking::default_hook at src/libstd/panicking.rs:217 10: std::panicking::rust_panic_with_hook at src/libstd/panicking.rs:520 11: rust_begin_unwind at src/libstd/panicking.rs:431 12: core::panicking::panic_fmt at src/libcore/panicking.rs:85 13: core::option::expect_none_failed at src/libcore/option.rs:1269 14: nsexec::main 15: std::rt::lang_start::{{closure}} 16: std::rt::lang_start_internal::{{closure}} at src/libstd/rt.rs:52 17: std::panicking::try::do_call at src/libstd/panicking.rs:342 18: std::panicking::try at src/libstd/panicking.rs:319 19: std::panic::catch_unwind at src/libstd/panic.rs:394 20: std::rt::lang_start_internal at src/libstd/rt.rs:51 21: main 22: __libc_start_main 23: _start note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. 

Output of chaosctl

chaos failed with: 1 error occurred:
	* admission webhook "vpodnetworkchaos.kb.io" denied the request: rpc error: code = Unknown desc = error code: exit status 101, msg: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Sys(ENOENT)', src/main.rs:75:61
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1: backtrace::backtrace::trace_unsynchronized
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:78
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1076
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1537
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:198
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:217
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:520
  11: rust_begin_unwind
             at src/libstd/panicking.rs:431
  12: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  13: core::option::expect_none_failed
             at src/libcore/option.rs:1269
  14: nsexec::main
  15: std::rt::lang_start::{{closure}}
  16: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:52
  17: std::panicking::try::do_call
             at src/libstd/panicking.rs:342
  18: std::panicking::try
             at src/libstd/panicking.rs:319
  19: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  20: std::rt::lang_start_internal
             at src/libstd/rt.rs:51
  21: main
  22: __libc_start_main
  23: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.




E0122 15:31:10.992920    9387 portforward.go:400] an error occurred forwarding 43995 -> 31767: error forwarding port 31767 to pod d0ea0f48221a34a7a1c72b30126363808be3747e2c61bf09e4be40b50ee41ae6, uid : exit status 1: 2021/01/22 14:31:10 socat[85038] E connect(5, AF=2 127.0.0.1:31767, 16): Connection refused
[Chaos]: network-delay-1000ms-ANONYMIZED-nginx

[Pod]: nginx-6846975d77-lmq2d

Error: for nginx-6846975d77-lmq2d: container get pid failed: rpc error: code = Unavailable desc = connection closed
Usage:
  chaosctl debug networkchaos (CHAOSNAME) [-n NAMESPACE] [flags]

Flags:
  -h, --help   help for networkchaos

Global Flags:
  -n, --namespace string   namespace to find chaos (default "default")

2021-01-22 15:31:10.993500 I | for nginx-6846975d77-lmq2d: container get pid failed: rpc error: code = Unavailable desc = connection closed

@STRRL STRRL self-assigned this Jan 24, 2021
@STRRL STRRL added chaos/network type/bug Something isn't working chaos/pod labels Jan 24, 2021
@STRRL
Copy link
Member

STRRL commented Jan 25, 2021

Hi @torblerone, currently, this is the expected behavior of the combination of PodChaos of NetworkChaos. If you want to apply different chaos onto the same explosive area, we do not guarantee it's completable now.

As you mentioned, behaviors like "injecting Chaos A into some pods, then injecting Chaos B into rest pods" are kinds of specifications about chaos experiment targets. Unfortunately, we do not support this spec now.

You could tag different labels to split pods and use labelSelector to divide chaos experiment targets.

I am going to close this issue now, if you have other ideas, feel free to reopon it. 😁

@STRRL STRRL closed this as completed Jan 25, 2021
@STRRL STRRL removed the type/bug Something isn't working label Jan 25, 2021
@YangKeao
Copy link
Member

I prefer to regard it as a bug until we fix it 💢 . @STRRL don't forget the workflow will provide the ability to easily spawn two chaos at the same time.

@YangKeao YangKeao reopened this Jan 25, 2021
@YangKeao YangKeao added the type/bug Something isn't working label Jan 25, 2021
@STRRL
Copy link
Member

STRRL commented Jan 25, 2021

I haven't figured out the reasonable behavior for this combination of more than one type of chaos. 🤔

Any suggestion? @YangKeao

@torblerone
Copy link
Author

I didn't deep dive into the chaos-controller, but doesn't it hold some kind of information of Pods that are affected by certain type of chaos? You could block any new Chaos Experiment to run on those Pods by implementing a mechanism that "blocks" already affected Pods if I'm right.

grafik

Otherwise the chaos-daemon itself could have a mechanism which blocks affecting the same Pod with 2 types of Chaos?

@torblerone
Copy link
Author

I discovered that when we deploy a new version of a service (which happens very often, especially on DEV environment) and NetworkChaos is running concurrently, the NetworkChaos also breaks.

You probably can reproduce this by deploying a new image or similar to a Deployment while NetworkChaos is running with target on the current Pods (which have the old version and will be replaced e.g. on rolling upgrades)

@YangKeao
Copy link
Member

I discovered that when we deploy a new version of a service (which happens very often, especially on DEV environment) and NetworkChaos is running concurrently, the NetworkChaos also breaks.

You probably can reproduce this by deploying a new image or similar to a Deployment while NetworkChaos is running with target on the current Pods (which have the old version and will be replaced e.g. on rolling upgrades)

Thanks for your reporting.

Yes. It's a known issue for us. chaos mesh didn't try to inject chaos for newly created pods. What's more, when a selected pod died, it didn't look for a new pod to fullfill its selector requirement (such as max percent requirement). It's definitely a bug of Chaos Mesh and will be fixed someday.

We have tried several times (and several pull requests) to fix this problem, but these PRs are all too complicate or expensive and finally none of them got merged. But once the Workflow feature reaches (in several weeks), and we cut done the load of twophase controller (the cron part), it would be much more realistic to make another attempt 😸 .

@Pantalones411
Copy link

I am experiencing the same issue and wondering if there was ever a workaround found for this.

@g1eny0ung g1eny0ung changed the title Dying Pods break NetworkChaos Dying Pods break NetworkChaos (and perhaps others that inject into the pid) Jun 7, 2023
@g1eny0ung g1eny0ung pinned this issue Jun 7, 2023
@DutchEllie
Copy link

Pretty incredible how impossibly stuck it gets. I have an HTTPChaos resource in my cluster that failed 4 days ago because some pod restarted due to the error. It's been completely stuck for 4 days, it cannot be deleted, not be started, not paused. It's impossible to work on. The entire namespace is stuck deleting and when forcing it to delete through some workaround on RedHat's website, after creating the namespace again the resource returns.

It creates error messages every couple seconds and has amassed a good few thousand of them by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants