Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seems like quinn 0.11 not working well under heavy load #1867

Closed
szguoxz opened this issue May 17, 2024 · 12 comments
Closed

seems like quinn 0.11 not working well under heavy load #1867

szguoxz opened this issue May 17, 2024 · 12 comments

Comments

@szguoxz
Copy link

szguoxz commented May 17, 2024

I can not be sure. But my debug tells me it only can be quinn problem. :-)
I am not sure if I hit the bug you guys fixed in 0.11.1. Since you didn't release to crates.io, not sure how to use 0.11.1.

Anyway, seems the latest release not as stable as 0.10. But I could be wrong! It seems the data got stuck under heavy load, can't be sent on the stream.

@Ralith
Copy link
Collaborator

Ralith commented May 17, 2024

quinn-proto 0.11.1 was released on crates.io 9 days ago. Are you using it?

This isn't an actionable report. What is the specific behavior? Do you have a reproducible test case?

@szguoxz
Copy link
Author

szguoxz commented May 17, 2024

Oh, I went to crates.io, I saw quinn is 0.11, I thought quinn-proto is the same version.
Yes, I am using the latest quinn-proto version, 0.11.1.
My connection got stuck from time to time, I can't figure out how to reproduce it yet. I happens in days, some times in minutes if I am lucky.

I am still trying to find a way to prove it's the stream, but not yet, maybe it's my own problem.

@Ralith
Copy link
Collaborator

Ralith commented May 18, 2024

What exactly does "got stuck" mean? Is the sender unable to write data to a stream? Is the receiver unable to read data that was successfully written? Are other functions of the connection degraded in any way?

There have been some reports of stream flow control issues in #1818; I wonder if that might be related. If this is a flow control issue, then you should see all previously written data successfully received, but an inability to write new data. You can track this by logging the total number of bytes written to/read from the stream in question.

I happens in days, some times in minutes if I am lucky.

Can you run your workload many times concurrently to deliberately trigger the behavior more frequently?

@Ralith
Copy link
Collaborator

Ralith commented May 18, 2024

Some interesting internal Quinn state you could try to capture when your application stops making progress:
Send side:

quinn_proto::connection::streams::StreamsState::{max_data, data_sent, unacked_data}
quinn_proto::connection::streams::Send::{max_data, pending.offset()}

Receive side:

quinn_proto::connection::streams::StreamsState::{local_max_data, sent_max_data}
quinn_proto::connection::streams::Recv::{end, sent_stream_max_data, assembler.bytes_read()}

@szguoxz
Copy link
Author

szguoxz commented May 18, 2024

yes, it seems a flow-control problem. It's working fine, and suddenly, it cant' write new data. The write_all got stuck.
Well, that's just my guess, I am still trying to log stuff to back my guess.

@szguoxz
Copy link
Author

szguoxz commented May 18, 2024

It seems these infos are not publicly available?

Some interesting internal Quinn state you could try to capture when your application stops making progress: Send side:

quinn_proto::connection::streams::StreamsState::{max_data, data_sent, unacked_data}
quinn_proto::connection::streams::Send::{max_data, pending.offset()}

Receive side:

quinn_proto::connection::streams::StreamsState::{local_max_data, sent_max_data}
quinn_proto::connection::streams::Recv::{end, sent_stream_max_data, assembler.bytes_read()}

@szguoxz
Copy link
Author

szguoxz commented May 18, 2024

I did a test, I am creating a VPN. sending packet through Quic.
using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.
I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.
Bi-stream is much more stable.

It still hangs from time to time, I still can't find out why yet. But I can be pretty sure it's because the data can't be sent somehow. Not only can't be sent, but some block the flow. i.e, write_all().await got stuck.

Very tough to re-produce. will continue to watch.

@Ralith
Copy link
Collaborator

Ralith commented May 18, 2024

It seems these infos are not publicly available?

Yes, they are internal Quinn state. You can use a modified version of Quinn to insert whatever logging or getters you like.

@Ralith
Copy link
Collaborator

Ralith commented May 18, 2024

using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.

If using short-lived streams fails much more often, can you build a test case using that pattern? If you're observing the same behavior when using short-lived streams, it is much less likely to be a flow control issue.

I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.

That parameter governs concurrency. It will not cause your application to hang unless your application is incorrect. In most cases, you should be able to set it to 1 and have no adverse effects beyond degraded throughput.

@szguoxz
Copy link
Author

szguoxz commented May 18, 2024

Is there a way to require the stream to send data and receive ACK in a certain time frame? If it does not get the ack back in time, the stream invalidate the connection?

I think what I am looking for is a "ACK timeout" setting on transport config. Is it possible?

using a bi-stream with lengthdelimited framing works much stable than using unistream, one stream per frame.

If using short-lived streams fails much more often, can you build a test case using that pattern? If you're observing the same behavior when using short-lived streams, it is much less likely to be a flow control issue.

I believe the default transportconfig is the one to blame. for example, I need to adjustify the MAx_concurrent_unistream. 100 is way too low. But even I change it 1000, it's still not stable.

That parameter governs concurrency. It will not cause your application to hang unless your application is incorrect. In most cases, you should be able to set it to 1 and have no adverse effects beyond degraded throughput.

@Ralith
Copy link
Collaborator

Ralith commented May 18, 2024

The health of a connection is independent of the state of an individual stream. If a connection is healthy, then so are its streams. If a peer stops responding, the connection will time out according to the idle timeout.

@szguoxz szguoxz closed this as completed May 21, 2024
@djc djc closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024
@Ralith
Copy link
Collaborator

Ralith commented May 21, 2024

Did you root-cause your issue? Is there something we could document better to avoid similar issues in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants