Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: problematic Kafka sources prevent standalone instance from starting on Cloud #16693

Open
BugenZhao opened this issue May 10, 2024 · 4 comments
Assignees
Milestone

Comments

@BugenZhao
Copy link
Member

BugenZhao commented May 10, 2024

Describe the bug

When a standalone instance on Cloud (with only 1 CPU core) restarts while there are many Kafka sources with brokers down in the catalog, it'll fail to enter the recovery phase, thus cannot function or serve any queries at all.

Error message/log

There could be various types of errors that prevent the meta service from starting up, where almost all of them are some sort of "timeout", including...

  • Unable to start leader services: BackupStorage error: s3 error: dispatch failure: other: identity resolver timed out after 5s
  • lease ... keep alive timeout (lose leader)
  • recover mview progress should not fail: Failed to acquire connection from pool: Connection pool timed out

To Reproduce

  1. export TOKIO_WORKER_THREADS=1 to simulate the case for the resource limit of 1 CPU core in Cloud.
  2. Start risingwave and a healthy Kafka cluster.
  3. Create 10 Kafka sources.
  4. Kill risingwave and the Kafka cluster.
  5. Run nc -k -l <broker-port> to simulate a Kafka broker that never refuses a connection nor responses.
  6. Start risingwave again.
  7. Observe bunch of error logs from rdkafka.
  8. Find it not be able to serve and panics after seconds.

Expected behavior

Should at least start the meta service on 5690 port and then the frontend service, allowing users to run DROP SOURCE to drop the problematic sources.

How did you deploy RisingWave?

  • Single-node or standalone mode, with only 1 tokio worker threads.
  • Or, free-tier in RisingWave cloud.

The version of RisingWave

No response

Additional context

By setting TOKIO_WORKER_THREADS to a larger number (like 4), the problem is addressed.

By attaching the debugger, I believe it's caused by the synchronous interfaces in rust-rdkafka (issue: fede1024/rust-rdkafka#358) called in ConnectorSourceWorker::run:

let splits = self.enumerator.list_splits().await.map_err(|e| {
source_is_up(0);
self.fail_cnt += 1;
e
})?;

Then
async fn fetch_topic_partition(&self) -> ConnectorResult<Vec<i32>> {
// for now, we only support one topic
let metadata = self
.client
.fetch_metadata(Some(self.topic.as_str()), self.sync_call_timeout)
.await?;

Note that fetch_metadata is actually a sync interface. It's marked as async to be compatible with madsim mocked interfaces. The timeout is also implemented in a sync way. When there's something wrong with the connection, the function call will block the thread.

Since there's only 1 tokio worker thread, a thread being blocked means that the whole RW service is blocked. As a result, there could be any kinds of weird timeout error to be happen.

@BugenZhao BugenZhao added the type/bug Something isn't working label May 10, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 10, 2024
@tabVersion
Copy link
Contributor

@wangrunji0408 shall we also make fetch_metadata spawn blocking?

@tabVersion
Copy link
Contributor

madsim-rs/madsim#209 can help with the issue.

@xxchan
Copy link
Member

xxchan commented May 15, 2024

Let's retest this

@BugenZhao
Copy link
Member Author

By setting TOKIO_WORKER_THREADS to a larger number (like 4), the problem is addressed.

Also I believe that using multiple worker threads for deployments with only a single CPU core, especially in standalone or single-node mode, would generally be better as it reduces the possibility for encountering such problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants