seems like quinn 0.11 not working well under heavy load #1867

szguoxz · 2024-05-17T17:10:14Z

I can not be sure. But my debug tells me it only can be quinn problem. :-)
I am not sure if I hit the bug you guys fixed in 0.11.1. Since you didn't release to crates.io, not sure how to use 0.11.1.

Anyway, seems the latest release not as stable as 0.10. But I could be wrong! It seems the data got stuck under heavy load, can't be sent on the stream.

Ralith · 2024-05-17T17:19:06Z

quinn-proto 0.11.1 was released on crates.io 9 days ago. Are you using it?

This isn't an actionable report. What is the specific behavior? Do you have a reproducible test case?

szguoxz · 2024-05-17T23:54:11Z

Oh, I went to crates.io, I saw quinn is 0.11, I thought quinn-proto is the same version.
Yes, I am using the latest quinn-proto version, 0.11.1.
My connection got stuck from time to time, I can't figure out how to reproduce it yet. I happens in days, some times in minutes if I am lucky.

I am still trying to find a way to prove it's the stream, but not yet, maybe it's my own problem.

Ralith · 2024-05-18T00:15:38Z

What exactly does "got stuck" mean? Is the sender unable to write data to a stream? Is the receiver unable to read data that was successfully written? Are other functions of the connection degraded in any way?

There have been some reports of stream flow control issues in #1818; I wonder if that might be related. If this is a flow control issue, then you should see all previously written data successfully received, but an inability to write new data. You can track this by logging the total number of bytes written to/read from the stream in question.

I happens in days, some times in minutes if I am lucky.

Can you run your workload many times concurrently to deliberately trigger the behavior more frequently?

Ralith · 2024-05-18T02:32:52Z

Some interesting internal Quinn state you could try to capture when your application stops making progress:
Send side:

quinn_proto::connection::streams::StreamsState::{max_data, data_sent, unacked_data}
quinn_proto::connection::streams::Send::{max_data, pending.offset()}

Receive side:

quinn_proto::connection::streams::StreamsState::{local_max_data, sent_max_data}
quinn_proto::connection::streams::Recv::{end, sent_stream_max_data, assembler.bytes_read()}

szguoxz · 2024-05-18T03:34:26Z

yes, it seems a flow-control problem. It's working fine, and suddenly, it cant' write new data. The write_all got stuck.
Well, that's just my guess, I am still trying to log stuff to back my guess.

szguoxz · 2024-05-18T04:37:25Z

It seems these infos are not publicly available?

Some interesting internal Quinn state you could try to capture when your application stops making progress: Send side:

quinn_proto::connection::streams::StreamsState::{max_data, data_sent, unacked_data}
quinn_proto::connection::streams::Send::{max_data, pending.offset()}

Receive side:

quinn_proto::connection::streams::StreamsState::{local_max_data, sent_max_data}
quinn_proto::connection::streams::Recv::{end, sent_stream_max_data, assembler.bytes_read()}

szguoxz · 2024-05-18T05:07:36Z

I did a test, I am creating a VPN. sending packet through Quic.
using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.
I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.
Bi-stream is much more stable.

It still hangs from time to time, I still can't find out why yet. But I can be pretty sure it's because the data can't be sent somehow. Not only can't be sent, but some block the flow. i.e, write_all().await got stuck.

Very tough to re-produce. will continue to watch.

Ralith · 2024-05-18T05:11:38Z

It seems these infos are not publicly available?

Yes, they are internal Quinn state. You can use a modified version of Quinn to insert whatever logging or getters you like.

Ralith · 2024-05-18T05:19:42Z

using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.

If using short-lived streams fails much more often, can you build a test case using that pattern? If you're observing the same behavior when using short-lived streams, it is much less likely to be a flow control issue.

I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.

That parameter governs concurrency. It will not cause your application to hang unless your application is incorrect. In most cases, you should be able to set it to 1 and have no adverse effects beyond degraded throughput.

szguoxz · 2024-05-18T10:21:16Z

Is there a way to require the stream to send data and receive ACK in a certain time frame? If it does not get the ack back in time, the stream invalidate the connection?

I think what I am looking for is a "ACK timeout" setting on transport config. Is it possible?

using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.

If using short-lived streams fails much more often, can you build a test case using that pattern? If you're observing the same behavior when using short-lived streams, it is much less likely to be a flow control issue.

I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.

That parameter governs concurrency. It will not cause your application to hang unless your application is incorrect. In most cases, you should be able to set it to 1 and have no adverse effects beyond degraded throughput.

Ralith · 2024-05-18T18:04:58Z

The health of a connection is independent of the state of an individual stream. If a connection is healthy, then so are its streams. If a peer stops responding, the connection will time out according to the idle timeout.

Ralith · 2024-05-21T19:00:26Z

Did you root-cause your issue? Is there something we could document better to avoid similar issues in the future?

Ralith mentioned this issue May 18, 2024

long running bi stream #1866

Closed

szguoxz closed this as completed May 21, 2024

djc closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seems like quinn 0.11 not working well under heavy load #1867

seems like quinn 0.11 not working well under heavy load #1867

szguoxz commented May 17, 2024

Ralith commented May 17, 2024

szguoxz commented May 17, 2024

Ralith commented May 18, 2024

Ralith commented May 18, 2024

szguoxz commented May 18, 2024

szguoxz commented May 18, 2024

szguoxz commented May 18, 2024

Ralith commented May 18, 2024

Ralith commented May 18, 2024 •

edited

szguoxz commented May 18, 2024

Ralith commented May 18, 2024

Ralith commented May 21, 2024

seems like quinn 0.11 not working well under heavy load #1867

seems like quinn 0.11 not working well under heavy load #1867

Comments

szguoxz commented May 17, 2024

Ralith commented May 17, 2024

szguoxz commented May 17, 2024

Ralith commented May 18, 2024

Ralith commented May 18, 2024

szguoxz commented May 18, 2024

szguoxz commented May 18, 2024

szguoxz commented May 18, 2024

Ralith commented May 18, 2024

Ralith commented May 18, 2024 • edited

szguoxz commented May 18, 2024

Ralith commented May 18, 2024

Ralith commented May 21, 2024

Ralith commented May 18, 2024 •

edited