Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.9.0-rc chaos-mesh case failed #16706

Open
cyliu0 opened this issue May 11, 2024 · 3 comments
Open

v1.9.0-rc chaos-mesh case failed #16706

cyliu0 opened this issue May 11, 2024 · 3 comments
Labels
found-by-chaos-mesh type/bug Something isn't working
Milestone

Comments

@cyliu0
Copy link
Contributor

cyliu0 commented May 11, 2024

Describe the bug

https://buildkite.com/risingwave-test/chaos-mesh/builds/826#018f659e-814b-43fd-9534-32cadf65a1a4
https://buildkite.com/risingwave-test/chaos-mesh/builds/825#018f659e-81e8-42f4-8a0d-238aa91fc23b

Seems like the test stuck at creating table

image

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

v1.9.0-rc

Additional context

No response

@cyliu0 cyliu0 added type/bug Something isn't working block-release-v1.9 labels May 11, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 11, 2024
@lmatz
Copy link
Contributor

lmatz commented May 13, 2024

In both "create table stuck" cases, the chaos we ingested include

[ name: networkchaos-frontend-meta-longcmkf-20240511-030816, mode: one, action: partition, direction: from, duration: 600s ]

There is a network partition between the frontend node and the meta node for 10 minutes.

We expect the create table to either be successful or return an error message after 10 minutes.

@fuyufjh
Copy link
Contributor

fuyufjh commented May 13, 2024

Check the await-tree dump file and didn't find anything wrong. The streaming was running normally. I guess the problem was because no one notifies the frontend create table request to stop waiting and returns an error.

There is a network partition between the frontend node and the meta node for 10 minutes.

I think the call stack will be like

frontend request (protocol layer)
  - create table handler
     - RPC call to Meta <-- waiting here

Perhaps the RPC call to Meta didn't fail as expected. Let's dive into it.

@cyliu0
Copy link
Contributor Author

cyliu0 commented May 14, 2024

We reran those two for v1.9.0-rc-1.

One hit this again https://buildkite.com/risingwave-test/chaos-mesh/builds/829#018f7135-e456-440d-8212-0d6b161e9dee.

Another one passed https://buildkite.com/risingwave-test/chaos-mesh/builds/830.

So the reproducible rate is quite high but not 100%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
found-by-chaos-mesh type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants