Requests randomly fail to receive data #6806

TimOrme · 2021-04-28T18:34:29Z

TimOrme
Apr 28, 2021

Example

I'm having a very hard time reproducing this issue consistently, but have exhausted all other avenues that I could think of. I'll do my best to describe the setup here, but unfortunately I couldn't come up with a code sample that I was able to reproduce it consistently.

The gist is like this however:

import functools
from fastapi import FastAPI, Depends, HTTPException
from pydantic import BaseModel
from sqlalchemy import create_engine
from sqlalchemy.engine import Engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.orm import Session, sessionmaker
import ddtrace

app = FastAPI()

@functools.lru_cache(maxsize=1)
def get_database_engine() -> Engine:
    """Retrieve the database engine.  There should only be one of these per application."""
    engine = create_engine("postgres://")
    return engine


@functools.lru_cache(maxsize=1)
def get_database_session(engine: Engine = Depends(get_database_engine)) -> sessionmaker:
    """Retrieve the database session maker.  There should only be one of these per application."""
    return sessionmaker(autocommit=True, autoflush=False, bind=engine)


def get_db(session_local: sessionmaker = Depends(get_database_session)):
    """Get a database connection."""
    with ddtrace.tracer.trace("db_connection_acquire"):
        db = session_local()
        try:
            yield db
        finally:
            db.close()

class MappingRequest(BaseModel):
    """Request class to create new mappings."""

    original_id: str
    new_id: str

@app.post("/mapping")
def create_mapping(
    upload: MappingRequest, db: Session = Depends(get_db)
):
    """Create a new dicom deid mapping for a series."""
    with ddtrace.tracer.trace("mapping_start"):
        try:
            with db.begin():
                with ddtrace.tracer.trace("mapping_transaction"):
                    mapping = crud.create_mapping(
                        db, upload.original_id, upload.new_id
                    )
                    return {"mapping_id": mapping.id}
        except IntegrityError:
            raise HTTPException(status_code=409, detail="Mapping exists")

We have a simple sync endpoint which inserts keys into the database. The DB operation is very quick. We've also added span traces to debug the behavior using datadog.

Description

Under moderate load, about 10 requests per second, the endpoint mostly responds very fast in the 10ms range. However, sometimes there are extreme outliers in response time, where the request goes to 3 minutes before the connection is killed by our ALB.

Traces during the long operation show that the route code is never hit, nor are the dependencies. This seems to indicate that something within FastAPI is failing to properly schedule or submit the sync jobs into the thread pool [URL details obfuscated to remove company info]

Compared with normal requests:

This seems to be a similar issue to the one described here: https://stackoverflow.com/questions/61466243/uvicorn-not-processing-some-requests-randamly

And similar behavior here: #1195 but we are not on windows.

Environment

OS: [e.g. Linux / Windows / macOS]: docker with tiangolo/uvicorn-gunicorn-fastapi-docker:python:3.8, host is Linux.
FastAPI Version [e.g. 0.3.0]: 0.63.0
Python version: 3.8.2
Gunicorn version: 20.1.0
Uvicorn version: 0.13.4

Note that gunicorn and uvicorn were upgraded manually in the dockerfile in an attempt to resolve the issue. However, the default versions with the image exhibited the same behavior.

Additional context

We've tried a number of additional ways to reproduce this, but have been unsuccessful. When testing locally, even introducing 10x the load, we can't reproduce this issue.

Also should note this is running in AWS in ECS, behind an ALB. We have toyed with the timeout settings within gunicorn and uvicorn to try and address this as well, but none of those seem to solve it.

TimOrme · 2021-04-30T00:43:51Z

TimOrme
Apr 30, 2021
Author

Some additional details here. We added some middleware to see if we could further debug where it was hanging:

"""Custom asgi middleware."""
import ddtrace


class TraceSendReceiveAsgiMiddleware:
    """Middleware to trace each message sent and received in a connection."""

    def __init__(self, app):
        """Initialize the middleware."""
        self.app = app

    async def __call__(self, scope, receive, send):
        """Invoke the middleware."""

        async def wrapped_receive():
            with ddtrace.tracer.trace("receive"):
                return await receive()

        async def wrapped_send(message):
            with ddtrace.tracer.trace("send"):
                return await send(message)

        return await self.app(scope, wrapped_receive, wrapped_send)

With this, we see that for these cases, it does seem to be stuck on the await receive() portion.

I also attempted to fully convert to asynchronous endpoints, upgraded to SQLAlchemy 1.4, leveraged its new async support, and moved the endpoints themselves to be async as well. However, I see the same behavior even with the async endpoints. I'll update the ticket title to reflect this, as it doesn't seem isolated to sync endpoints.

Also should note that the current client for this is using httpx. Initially though that perhaps the client was simply slow, but ensured that client timeouts are set, and verified that they are at 5 seconds. It would appear then that the server is stuck waiting for data even after the client has presumably dropped the request.

0 replies

TimOrme · 2021-04-30T18:45:15Z

TimOrme
Apr 30, 2021
Author

Also tried to remove gunicorn/uvicorn from the equation, and replaced them with hypercorn. The behavior is similar, though with some interesting differences. First, it didn't fail after the 3 minute mark, but instead took over 11 minutes to complete, presumably because hypercorn doesn't axe the request on connection close.

It also doesn't give a return code however, and I actually see an error in the logs, which may be helpful:


2021-04-30T18:03:55.103Z | RuntimeError: cannot reuse already awaited coroutine
-- | --
2021-04-30T18:03:55.103Z | message = await self._receive()
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/starlette/requests.py", line 204, in stream
2021-04-30T18:03:55.103Z | Traceback (most recent call last):
2021-04-30T18:03:55.103Z | future: <Task finished coro=<<async_generator_athrow without __name__>()> exception=RuntimeError('cannot reuse already awaited coroutine')>
2021-04-30T18:03:55.103Z | ERROR:asyncio:Task exception was never retrieved
2021-04-30T18:03:55.103Z | RuntimeError: Cannot call write() after write_eof()
2021-04-30T18:03:55.103Z | File "uvloop/handles/stream.pyx", line 676, in uvloop.loop.UVStream.write
2021-04-30T18:03:55.103Z | self._transport.write(data)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/asyncio/streams.py", line 305, in write
2021-04-30T18:03:55.103Z | self.writer.write(event.data)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/tcp_server.py", line 89, in protocol_send
2021-04-30T18:03:55.103Z | await self.send(RawData(data=data))
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/h11.py", line 229, in _send_h11_event
2021-04-30T18:03:55.103Z | status_code=event.status_code,
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/h11.py", line 117, in stream_send
2021-04-30T18:03:55.103Z | status_code=status_code,
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/http_stream.py", line 168, in _send_error_response
2021-04-30T18:03:55.103Z | await self._send_error_response(500)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/protocol/http_stream.py", line 104, in app_send
2021-04-30T18:03:55.103Z | await send(None)
2021-04-30T18:03:55.103Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/context.py", line 45, in _handle
2021-04-30T18:03:55.103Z | Traceback (most recent call last):
2021-04-30T18:03:55.103Z | Exception ignored in: <coroutine object _handle at 0x7fe56822bf80>
2021-04-30T18:03:55.099Z | task: <Task pending coro=<_handle() running at /usr/local/lib/python3.7/site-packages/hypercorn/asyncio/context.py:39> wait_for=<Future cancelled> cb=[gather.<locals>._done_callback() at /usr/local/lib/python3.7/asyncio/tasks.py:691]>
2021-04-30T18:03:55.099Z | ERROR:asyncio:Task was destroyed but it is pending!
2021-04-30T18:03:55.099Z | RuntimeError: coroutine ignored GeneratorExit
2021-04-30T18:03:55.099Z | await TCPServer(app, loop, config, reader, writer)
2021-04-30T18:03:55.099Z | File "/usr/local/lib/python3.7/site-packages/hypercorn/asyncio/run.py", line 103, in _server_callback
2021-04-30T18:03:55.099Z | Traceback (most recent call last):
2021-04-30T18:03:55.099Z | Exception ignored in: <coroutine object worker_serve.<locals>._server_callback at 0x7fe5683c33b0>
2021-04-30T18:03:55.099Z | task: <Task pending coro=<worker_serve.<locals>._server_callback() running at /usr/local/lib/python3.7/site-packages/hypercorn/asyncio/run.py:103> wait_for=<_GatheringFuture pending cb=[<TaskWakeupMethWrapper object at 0x7fe568153550>()]>>
2021-04-30T18:03:55.099Z | ERROR:asyncio:Task was destroyed but it is pending!

0 replies

xpayn · 2021-05-06T15:38:08Z

xpayn
May 6, 2021

I'm having a similar issue.
It get stuck in a middleware when calling down the middleware stack, but it never enter the route handler.
My environment is an AWS lambda function called from api gateway, so for the moment I used logs to narrow the problem.
I'll try to reproduce it on my machine and try to use pdb to see where it's stuck.
It feels like some kind of deadlock.

0 replies

TimOrme · 2021-05-06T21:40:20Z

TimOrme
May 6, 2021
Author

I think I've figured this out, though not sure if this is really a problem with FastAPI or with httpx, which is the client I'm using. It appears that if a httpx client with keep-alive connections is running, and it has a timeout error, it doesn't complete the request. As such, FastAPI just waits patiently for more data to come in, even though the client request is dead. FastAPI doesn't notice that the client request is done until the connection itself is closed.

In short, the requests themselves aren't actually taking this long, it's just that the client has bailed, and FastAPI just keeps waiting.

I have a (mostly) reproducible example now:

DockerFile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

COPY ./app /app

/app/main.py:

from fastapi import FastAPI
from pydantic import BaseModel
import time



class TimingMiddleware:
    """Middleware to track request time."""

    def __init__(self, app):
        """Initialize the middleware."""
        self.app = app

    async def __call__(self, scope, receive, send):
        """Invoke the middleware."""

        async def wrapped_receive():
            start = time.time()
            data = await receive()
            end = time.time()
            print(f"RECEIVED IN {end-start}")
            return data

        return await self.app(scope, wrapped_receive, send)


app = FastAPI()
app.add_middleware(TimingMiddleware)


class Item(BaseModel):
    data: str


@app.post("/endpoint")
def endpoint(item: Item):
    return {"Hello": "World"}

test_client.py:

import asyncio
import httpx
import time


async def make_request(client):
    # post a big chunk of data
    await client.post("http://localhost/endpoint", json={"data": "A" * 10000000})
    return "XXX"


async def main():
    async with httpx.AsyncClient(timeout=httpx.Timeout(read=5, connect=5, pool=5, write=0.00000000000000000000000000000000000000001)) as client:
        tasks = []
        for x in range(5):
            tasks.append(asyncio.create_task(make_request(client)))

        # Gather all tasks
        done = await asyncio.gather(*tasks, return_exceptions=True)
        print(done)  # Should always print WriteTimeout.

        # Just sleep for 10 seconds.  Connections wont be released until the end of this.
        time.sleep(10)

asyncio.run(main())

Start the FastAPI server.

docker build -t timeout .
docker run --rm --name timeout -p 80:80 timeout

Run the test client:

python test_client.py

You'll observe that test_client.py immediately issues timeouts:

%python test_client.py
[WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout(''), WriteTimeout('')]

But that FastAPI doesn't register any output until 10 seconds later, when the client closes the connections. If you extend the wait time to 3 minutes, the same behavior occurs.

I'm not sure what the expected behavior really should be. It does seem that httpx could potentially be better about notifying the server that nothing else is coming, but also seems that something is potentially awry with how keep-alives are working with gunicorn/uvicorn/fastapi. Ideally, the server should be configurable to time out more proactively on these dead connections.

A simple workaround, in my case at least, was to avoid keeping long-lived httpx clients, and instead create a new client and connection for each request. This is probably worse from a performance perspective, but it does prevent these sort of confusing timings.

0 replies

xpayn · 2021-05-07T07:16:23Z

xpayn
May 7, 2021

In my case, I use httpx but to call another webservice from my fastapi app, my clients are browser and it sometimes hangs (indefinitely for FF and canceled after 4s by chrome).
My runtime environment is looking strangely like yours (a lambda is a container), so there's maybe something to dig there.
I'll try to run my code in docker and see if it hangs with a browser.

0 replies

magic-element-k · 2021-05-11T05:00:03Z

magic-element-k
May 11, 2021

@TimOrme I also had a problem with blocking requests. Look here for solutions to this problem
#3205
tiangolo/full-stack-fastapi-template#104

0 replies

tiangolo · 2022-09-04T19:54:34Z

tiangolo
Sep 4, 2022
Maintainer

Thanks for the report and discussion! ☕

This was possibly solved in #5122, released as part of FastAPI 0.82.0 🎉

Could you check it? If this solves it for you, you could close the issue. 🤓

0 replies

2022-12-09T14:14:30Z

github-actions[bot]
bot Dec 9, 2022

Assuming the original need was handled, this will be automatically closed now. But feel free to add more comments or create new issues or PRs.

0 replies

adarshpunj · 2023-02-02T11:50:23Z

adarshpunj
Feb 2, 2023

@tiangolo as of 0.82.0 it doesn't seem to be resolved. I'm having a hard time reproducing this, since it happens randomly:

Environment

OS: Docker image of python 3.8 on a Linux host
FastAPI v0.82.0
uvicorn v0.20.0
pg8000 v1.29.4
nginx/1.18.0 (Ubuntu)

As @TimOrme has mentioned this isn't a load issue — the requests never reach the server.

0 replies

arnavmehta7 · 2023-05-23T15:49:05Z

arnavmehta7
May 23, 2023

Same issue 😭

0 replies

cEhlen · 2023-06-12T15:16:41Z

cEhlen
Jun 12, 2023

I think I have the same issue. But only in my production environment.

It just hangs and I never get a response. Also happening in the browser.

Looking at the logs with TRACE level in uvicorn:

Jun 12 14:42:01 uvicorn: TRACE: ASGI [14] Started scope={'type': 'http', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.0', 'server': ('127.0.0.1', 8000), 'client': ('ip...', 0), 'scheme': 'https', 'method': 'GET', ...}}
Jun 12 14:57:41 uvicorn: TRACE: ASGI [7] Send {'type': 'http.response.start', 'status': 200, 'headers': '<...>'}

Here there is a 15m 40s delay between receiving the request and handling it.

This is happening on FastAPI v0.95.2 and Python 3.11.4.

Not sure how to solve this. Tried debugging with GDB but it just hangs waiting for the futex...

Anyone got any ideas what to check next?

1 reply

cEhlen Jun 23, 2023

It turns out the problem was a firewall dropping the database connection after 30 minutes of inactivity.
Not sure where the 15m 40s timeout was coming from.

Setting pool_recycle for the sqlalchemy engine to 25 minutes fixed the issue.

korabs-x · 2024-05-10T14:01:15Z

korabs-x
May 10, 2024

We also had the same issue in our production environment.
The reason was occasional packet loss in our GKE Autopilot cluster during times of high load.
During debugging, we observed this:

The request comes in, and reaches e.g. a custom middleware
When fastapi/starlette waits for the body, it's stuck in the occasional case when the body packet(s) are lost

As a workaround, we added an override for the stream function of starlette's _CachedRequest, which raises a timeout when an expected body chunk takes more than 1s to arrive.

Please note: This workaround only helps to "fail faster". It helps because we can retry directly again and get to the successful response much quicker, but isn't an ultimate solution. This was helpful for us as a temporary workaround, however the proper way to handle the error is of course to fix the underlying issue of packet loss. In addition, it only handles the request direction, we unfortunately didn't find a good way to detect packet loss in the response direction.

Here's the code we added for faster timeouts when receiving request data:

from starlette.middleware import base

RECEIVE_REQUEST_TIMEOUT = 1

@app.on_event("startup")
async def _startup():
    # Add timeout for receiving request bodys
    # starlette.requests.Request is used for both incoming and outgoing requests
    # starlette.middleware.base._CachedRequest is used only for the incoming request (is subclass of Request)
    stream_fn = base._CachedRequest.stream

    async def stream_with_timeout(self, *args, **kwargs):
        stream = stream_fn(self, *args, **kwargs)
        while True:
            try:
                # Fetch next item with timeout
                next_item = asyncio.wait_for(stream.__anext__(), RECEIVE_REQUEST_TIMEOUT)
                yield await next_item
            except asyncio.TimeoutError:
                logging.warn(f"Receive request timeout: {self.url}")
                raise HTTPException(status_code=408, detail="Request timeout")
            except StopAsyncIteration:
                break  # No more items in stream

    base._CachedRequest.stream = stream_with_timeout

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requests randomly fail to receive data #6806

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Requests randomly fail to receive data #6806

Example

Description

Environment

Additional context

Replies: 12 comments · 1 reply

TimOrme Apr 30, 2021 Author

TimOrme Apr 30, 2021 Author

TimOrme May 6, 2021 Author

tiangolo Sep 4, 2022 Maintainer

github-actions[bot] bot Dec 9, 2022

Replies: 12 comments 1 reply

TimOrme
Apr 30, 2021
Author

TimOrme
Apr 30, 2021
Author

TimOrme
May 6, 2021
Author

tiangolo
Sep 4, 2022
Maintainer

github-actions[bot]
bot Dec 9, 2022