Replies: 12 comments 1 reply
-
Some additional details here. We added some middleware to see if we could further debug where it was hanging: """Custom asgi middleware."""
import ddtrace
class TraceSendReceiveAsgiMiddleware:
"""Middleware to trace each message sent and received in a connection."""
def __init__(self, app):
"""Initialize the middleware."""
self.app = app
async def __call__(self, scope, receive, send):
"""Invoke the middleware."""
async def wrapped_receive():
with ddtrace.tracer.trace("receive"):
return await receive()
async def wrapped_send(message):
with ddtrace.tracer.trace("send"):
return await send(message)
return await self.app(scope, wrapped_receive, wrapped_send) With this, we see that for these cases, it does seem to be stuck on the I also attempted to fully convert to asynchronous endpoints, upgraded to SQLAlchemy 1.4, leveraged its new async support, and moved the endpoints themselves to be async as well. However, I see the same behavior even with the async endpoints. I'll update the ticket title to reflect this, as it doesn't seem isolated to sync endpoints. Also should note that the current client for this is using |
Beta Was this translation helpful? Give feedback.
-
Also tried to remove gunicorn/uvicorn from the equation, and replaced them with hypercorn. The behavior is similar, though with some interesting differences. First, it didn't fail after the 3 minute mark, but instead took over 11 minutes to complete, presumably because hypercorn doesn't axe the request on connection close. It also doesn't give a return code however, and I actually see an error in the logs, which may be helpful:
|
Beta Was this translation helpful? Give feedback.
-
I'm having a similar issue. |
Beta Was this translation helpful? Give feedback.
-
I think I've figured this out, though not sure if this is really a problem with FastAPI or with In short, the requests themselves aren't actually taking this long, it's just that the client has bailed, and FastAPI just keeps waiting. I have a (mostly) reproducible example now: DockerFile:
/app/main.py: from fastapi import FastAPI
from pydantic import BaseModel
import time
class TimingMiddleware:
"""Middleware to track request time."""
def __init__(self, app):
"""Initialize the middleware."""
self.app = app
async def __call__(self, scope, receive, send):
"""Invoke the middleware."""
async def wrapped_receive():
start = time.time()
data = await receive()
end = time.time()
print(f"RECEIVED IN {end-start}")
return data
return await self.app(scope, wrapped_receive, send)
app = FastAPI()
app.add_middleware(TimingMiddleware)
class Item(BaseModel):
data: str
@app.post("/endpoint")
def endpoint(item: Item):
return {"Hello": "World"} test_client.py: import asyncio
import httpx
import time
async def make_request(client):
# post a big chunk of data
await client.post("http://localhost/endpoint", json={"data": "A" * 10000000})
return "XXX"
async def main():
async with httpx.AsyncClient(timeout=httpx.Timeout(read=5, connect=5, pool=5, write=0.00000000000000000000000000000000000000001)) as client:
tasks = []
for x in range(5):
tasks.append(asyncio.create_task(make_request(client)))
# Gather all tasks
done = await asyncio.gather(*tasks, return_exceptions=True)
print(done) # Should always print WriteTimeout.
# Just sleep for 10 seconds. Connections wont be released until the end of this.
time.sleep(10)
asyncio.run(main()) Start the FastAPI server. docker build -t timeout .
docker run --rm --name timeout -p 80:80 timeout Run the test client: python test_client.py You'll observe that %python test_client.py
[WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout('')] But that FastAPI doesn't register any output until 10 seconds later, when the client closes the connections. If you extend the wait time to 3 minutes, the same behavior occurs. I'm not sure what the expected behavior really should be. It does seem that httpx could potentially be better about notifying the server that nothing else is coming, but also seems that something is potentially awry with how keep-alives are working with gunicorn/uvicorn/fastapi. Ideally, the server should be configurable to time out more proactively on these dead connections. A simple workaround, in my case at least, was to avoid keeping long-lived httpx clients, and instead create a new client and connection for each request. This is probably worse from a performance perspective, but it does prevent these sort of confusing timings. |
Beta Was this translation helpful? Give feedback.
-
In my case, I use httpx but to call another webservice from my fastapi app, my clients are browser and it sometimes hangs (indefinitely for FF and canceled after 4s by chrome). |
Beta Was this translation helpful? Give feedback.
-
@TimOrme I also had a problem with blocking requests. Look here for solutions to this problem |
Beta Was this translation helpful? Give feedback.
-
Thanks for the report and discussion! ☕ This was possibly solved in #5122, released as part of FastAPI 0.82.0 🎉 Could you check it? If this solves it for you, you could close the issue. 🤓 |
Beta Was this translation helpful? Give feedback.
-
Assuming the original need was handled, this will be automatically closed now. But feel free to add more comments or create new issues or PRs. |
Beta Was this translation helpful? Give feedback.
-
@tiangolo as of 0.82.0 it doesn't seem to be resolved. I'm having a hard time reproducing this, since it happens randomly: Environment
As @TimOrme has mentioned this isn't a load issue — the requests never reach the server. |
Beta Was this translation helpful? Give feedback.
-
Same issue 😭 |
Beta Was this translation helpful? Give feedback.
-
I think I have the same issue. But only in my production environment. It just hangs and I never get a response. Also happening in the browser. Looking at the logs with TRACE level in uvicorn:
Here there is a 15m 40s delay between receiving the request and handling it. This is happening on FastAPI v0.95.2 and Python 3.11.4. Not sure how to solve this. Tried debugging with GDB but it just hangs waiting for the futex... Anyone got any ideas what to check next? |
Beta Was this translation helpful? Give feedback.
-
We also had the same issue in our production environment.
As a workaround, we added an override for the Please note: This workaround only helps to "fail faster". It helps because we can retry directly again and get to the successful response much quicker, but isn't an ultimate solution. This was helpful for us as a temporary workaround, however the proper way to handle the error is of course to fix the underlying issue of packet loss. In addition, it only handles the request direction, we unfortunately didn't find a good way to detect packet loss in the response direction. Here's the code we added for faster timeouts when receiving request data:
|
Beta Was this translation helpful? Give feedback.
-
Example
I'm having a very hard time reproducing this issue consistently, but have exhausted all other avenues that I could think of. I'll do my best to describe the setup here, but unfortunately I couldn't come up with a code sample that I was able to reproduce it consistently.
The gist is like this however:
We have a simple sync endpoint which inserts keys into the database. The DB operation is very quick. We've also added span traces to debug the behavior using datadog.
Description
Under moderate load, about 10 requests per second, the endpoint mostly responds very fast in the 10ms range. However, sometimes there are extreme outliers in response time, where the request goes to 3 minutes before the connection is killed by our ALB.
Traces during the long operation show that the route code is never hit, nor are the dependencies. This seems to indicate that something within FastAPI is failing to properly schedule or submit the sync jobs into the thread pool [URL details obfuscated to remove company info]
Compared with normal requests:
This seems to be a similar issue to the one described here: https://stackoverflow.com/questions/61466243/uvicorn-not-processing-some-requests-randamly
And similar behavior here: #1195 but we are not on windows.
Environment
Note that gunicorn and uvicorn were upgraded manually in the dockerfile in an attempt to resolve the issue. However, the default versions with the image exhibited the same behavior.
Additional context
We've tried a number of additional ways to reproduce this, but have been unsuccessful. When testing locally, even introducing 10x the load, we can't reproduce this issue.
Also should note this is running in AWS in ECS, behind an ALB. We have toyed with the timeout settings within gunicorn and uvicorn to try and address this as well, but none of those seem to solve it.
Beta Was this translation helpful? Give feedback.
All reactions